We all know that patents is one of the forces driving progress, I was curious how leading tech companies perform in this field. Do they innovate a lot? How innovations rate fluctuates? What is the most popular category of innovation? What are global innovation trends?
For my final project I will perform a study analysis of the patent registration data of the top 10 Technology Companies in the world trying to find answers to the following questions:
Example of web page where data was scraped:
http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-bool.html&r=37&f=G&l=50&co1=AND&d=PTXT&s1=Apple.ASNM.&OS=AN/Apple&RS=AN/Apple
The following data was scraped for each company:
United States Patent Number
Date Field
Current U.S. Class
Patent title
Assignee name
link
Saving data to csv files. Python language was used to save data in csv files.
Importing csv file to MySql database. R language was used for the importing.
Reading data from MySql database. R language and SQL language was used for reading data from database.
Performing analysis. (R language)
Time period: 1968 till present
Number of patents regestred: 353683
Number of categories: 436
Number of companies analysed: 10
Note: Data Collection and Transformation Script is written in Python and presented in GitHub. The rest script is presented below.
library("stringr")
library("dplyr")
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library("tidyr")
library("RMySQL")
## Loading required package: DBI
library(DBI)
library("ggplot2")
#install.packages("sqldf")
#library("sqldf")
library(plyr)
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
library(wordcloud)
## Loading required package: RColorBrewer
#declare some constants
data.dir = "/Users/Olga/Desktop/PycharmProjects/r/data/"
input.files.dir = paste0(data.dir,"patents/")
classification.path = paste0(data.dir,"classification_csv.csv")
suffix = "_us_csv.csv"
#function for connecting to MySQL database
create.connection <- function(){
connection <- dbConnect(MySQL(),
user="root", password="olga1234",
dbname="patents", host="localhost")
}
#function for creating table in db
create.table <- function(connection, schema, table){
create.table.query = paste0("CREATE TABLE ",schema,".",table,"
(
author varchar(100) not null,
patent_number varchar(20) not null,
date varchar(20) null,
year varchar(20) null,
us_cat varchar(200) null,
us_subcat varchar(200) null,
description varchar(1000) null,
title varchar(1000) null
)"
)
created.table <- dbSendStatement(connection, create.table.query)
dbClearResult(created.table)
}
#function for dropping db table
drop.table <- function(connection, schema, table){
drop.table.query = paste0("DROP TABLE IF EXISTS ",schema,".",table)
drop.table <- dbSendStatement(connection, drop.table.query)
dbClearResult(drop.table)
}
read.db <- function(table){
connection <- create.connection()
query <- paste0("SELECT * FROM ",table)
results <- dbGetQuery(connection,query)
results
}
#function for escaping single quote
clense <- function(val){
gsub("'", "''",toString(val))
}
#function for saving multiple records to db in a single transactopn
save.to.db <- function(connection,patents.df,author){
for (row in 1:nrow(patents.df)) {
patent.number <- clense(patents.df[row, "patent_number"])
date <- clense(patents.df[row, "date"])
year <- clense(patents.df[row, "year"])
us_cat <- clense(patents.df[row,"us_cat"])
us_subcat <- clense(patents.df[row,"us_subcat"])
title <- clense(patents.df[row,"title"])
description <- clense(patents.df[row,"description"])
insert.query = paste("INSERT INTO patents.PATENTS VALUES ( '",author,"','",patent.number,"', '",date,"', '",year,"', '",us_cat,"', '",us_subcat,"', '",description,"', '",title ,"')",sep="")
insert.row <- dbSendStatement(connection, insert.query)
dbClearResult(insert.row)
}
}
# function for reading csv
read_csv <- function(path,connection){
tryCatch({read.csv(file=path, fileEncoding ="cp1251", header=TRUE, quote = "\"", sep=",",stringsAsFactors=FALSE)}
, error = function(e) {
print(paste0("following file causing error: ",path))
print(e)
dbDisconnect(connection)
stop(e)
})
}
#this function does all the work with help of all functions declared above
saveall <- function(){
connection <- create.connection()
pattern = paste0(suffix,"$")
files.list <- list.files(input.files.dir,pattern = pattern)
classification.df = read_csv(classification.path, connection)
#colnames(classification.df)[1] <- "cat"
for (path in files.list){
absolute.path = paste0(input.files.dir,"/",path)
data.df <- read_csv(absolute.path, connection)
df <- left_join(data.df, classification.df, by = c('us_cat'))
author = tolower(str_replace(path, suffix,"")[[1]][1])
tryCatch({save.to.db(connection, df,author)}
, error = function(e) {
print(paste0("following file causing error: ",path))
print(e)
dbDisconnect(connection)
stop(e)
}
)
}
}
#saveall()
connection <- create.connection()
query<- "SELECT * from patents.PATENTS"
results <- dbGetQuery(connection,query)
dbDisconnect(connection)
## [1] TRUE
df<-results %>%
select (author, year, patent_number) %>%
group_by(author, year) %>%
dplyr::summarise(patent_count = n_distinct(patent_number))
df$year<- as.numeric(as.character(df$year))
overall<-ggplot(data = df, aes(x=year, y = patent_count, color = author)) +
geom_line()+ theme(axis.text.x = element_text(angle = 90, hjust =1))+scale_color_brewer(palette="Spectral")+
ggtitle ("Number Of Registered Patents By Company By Year") + xlab("year") + ylab("number of patents")
overall
Noticable leaders of innovations among top 10 companies are IBM and Samsung.
Intel and Microsoft have steady decreasing pace in patent registrations.
Facebook and Tencent innovate least of all companies.
Google has a sharp increase in patent registration since 2010 and demonstrate good innovation potential.
df1<-results %>%
select (year, patent_number) %>%
group_by(year) %>%
dplyr::summarise(patent_count = n_distinct(patent_number))
df1$year<- as.numeric(as.character(df1$year))
overall_agg<-ggplot()+
geom_line(data = df1, aes(x = year, y = patent_count), color = "blue") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
ggtitle ("Number Of Registered Patents By Year (top 10 tech companies)") + xlab("year") + ylab("number of patents")
overall_agg
Graph shows that sharp increase in innovation pace starts from 1990.
Most likely following crises affected innovation pace of top 10 tech companies:
dot-com bubble crisis slowed the pace a bit, but after that steady growth have being continued.
Economic crisis 2007-2008 significantly negatively affected the innovation pace and only in 2012 innovation pace exceed the pace of pre crisis pace.
df2<-results %>%
select (description,patent_number) %>%
group_by(description) %>%
dplyr::summarise(patent_count = n_distinct(patent_number)) %>%
arrange(desc(patent_count)) %>%
top_n(n=20)
## Selecting by patent_count
popular_title<-ggplot(df2, aes(x=reorder(description,-patent_count), y=patent_count)) +
geom_bar(stat="identity")+theme_minimal() +
geom_col(aes(fill = patent_count)) +
scale_fill_gradient2(low = "white", high = "blue") +
coord_flip() +
theme(text = element_text(size=6))+
ggtitle ("The most popular patents of all time among top 10 tech companies") + xlab("Total Count") + ylab("Patent Group Name")
popular_title
If I ignore Miscellaneous category as it tells a litle about patent nature, the leading patent categories are active solid-atate devices; data processing: database and file management; electrical computers and digital processing systems.
In order to admit any changes in innovation trend over time, I will look at the leading innovation categories of the following periods:
before 1990 year
from 1990 till 2000
from 2000 till 2010
2010 till present
results$year<- as.numeric(as.character(results$year))
df3<-results %>%
filter (year<1990) %>%
select (description,patent_number) %>%
group_by(description) %>%
dplyr::summarise(patent_count = n_distinct(patent_number)) %>%
arrange(desc(patent_count)) %>%
top_n(n=10)
## Selecting by patent_count
popular_title<-ggplot(df3, aes(x=reorder(description,-patent_count), y=patent_count)) +
geom_bar(stat="identity")+theme_minimal() +
geom_col(aes(fill = patent_count)) +
scale_fill_gradient2(low = "white", high = "dark green") +
coord_flip() +
theme(text = element_text(size=6))+
ggtitle ("Leading innovation categories before 1990 year") + xlab("Total Count") + ylab("Patent Group Name")
popular_title
results$year<- as.numeric(as.character(results$year))
df4<-results %>%
filter(between(year,1990,2000)) %>%
select (description,patent_number) %>%
group_by(description) %>%
dplyr::summarise(patent_count = n_distinct(patent_number)) %>%
arrange(desc(patent_count)) %>%
top_n(n=10)
## Selecting by patent_count
popular_title<-ggplot(df4, aes(x=reorder(description,-patent_count), y=patent_count)) +
geom_bar(stat="identity")+theme_minimal() +
geom_col(aes(fill = patent_count)) +
scale_fill_gradient2(low = "white", high = "dark green") +
coord_flip() +
theme(text = element_text(size=6))+
ggtitle ("Leading innovation categories from 1990 till 2000") + xlab("Total Count") + ylab("Patent Group Name")
popular_title
results$year<- as.numeric(as.character(results$year))
df5<-results %>%
filter(between(year, 2000 , 2010 )) %>%
select (description,patent_number) %>%
group_by(description) %>%
dplyr::summarise(patent_count = n_distinct(patent_number)) %>%
arrange(desc(patent_count)) %>%
top_n(n=10)
## Selecting by patent_count
popular_title<-ggplot(df5, aes(x=reorder(description,-patent_count), y=patent_count)) +
geom_bar(stat="identity")+theme_minimal() +
geom_col(aes(fill = patent_count)) +
scale_fill_gradient2(low = "white", high = "dark green") +
coord_flip() +
theme(text = element_text(size=6))+
ggtitle ("Leading innovation categories from 2000 till 2010") + xlab("Total Count") + ylab("Patent Group Name")
popular_title
results$year<- as.numeric(as.character(results$year))
df6<-results %>%
filter(between(year, 2010 , 2018 )) %>%
select (description,patent_number) %>%
group_by(description) %>%
dplyr::summarise(patent_count = n_distinct(patent_number)) %>%
arrange(desc(patent_count)) %>%
top_n(n=10)
## Selecting by patent_count
popular_title<-ggplot(df6, aes(x=reorder(description,-patent_count), y=patent_count)) +
geom_bar(stat="identity")+theme_minimal() +
geom_col(aes(fill = patent_count)) +
scale_fill_gradient2(low = "white", high = "dark green") +
coord_flip() +
theme(text = element_text(size=6))+
ggtitle ("Leading innovation categories from 2010 till 2018") + xlab("Total Count") + ylab("Patent Group Name")
popular_title
Main innovation categories change over time, but “Active solid state devices” is the leading patent category across all periods. Also the “evolution” of data related leading patent categories is the following: starting from data/information storage before 90th, then data processing: database and file management (1990-2020) to Recording, communication, or information retrieval equipment and image processing (from 2010 till present).
results$year<- as.numeric(as.character(results$year))
df7<-results %>%
filter(author =="apple") %>%
select (description,patent_number) %>%
group_by(description) %>%
dplyr::summarise(patent_count = n_distinct(patent_number)) %>%
arrange(desc(patent_count)) %>%
top_n(n=10)
## Selecting by patent_count
popular_title<-ggplot(df7, aes(x=reorder(description,-patent_count), y=patent_count)) +
geom_bar(stat="identity")+theme_minimal() +
geom_col(aes(fill = patent_count)) +
scale_fill_gradient2(low = "white", high = "dark blue") +
coord_flip() +
theme(text = element_text(size=6))+
ggtitle ("Apple leading innovation categories") + xlab("Total Count") + ylab("Patent Group Name")
popular_title
results$year<- as.numeric(as.character(results$year))
df8<-results %>%
filter(author =="facebook") %>%
select (description,patent_number) %>%
group_by(description) %>%
dplyr::summarise(patent_count = n_distinct(patent_number)) %>%
arrange(desc(patent_count)) %>%
top_n(n=10)
## Selecting by patent_count
popular_title<-ggplot(df8, aes(x=reorder(description,-patent_count), y=patent_count)) +
geom_bar(stat="identity")+theme_minimal() +
geom_col(aes(fill = patent_count)) +
scale_fill_gradient2(low = "white", high = "dark blue") +
coord_flip() +
theme(text = element_text(size=6))+
ggtitle ("Facebook leading innovation categories") + xlab("Total Count") + ylab("Patent Group Name")
popular_title
Google ——-
results$year<- as.numeric(as.character(results$year))
df9<-results %>%
filter(author =="google") %>%
select (description,patent_number) %>%
group_by(description) %>%
dplyr::summarise(patent_count = n_distinct(patent_number)) %>%
arrange(desc(patent_count)) %>%
top_n(n=10)
## Selecting by patent_count
popular_title<-ggplot(df9, aes(x=reorder(description,-patent_count), y=patent_count)) +
geom_bar(stat="identity")+theme_minimal() +
geom_col(aes(fill = patent_count)) +
scale_fill_gradient2(low = "white", high = "dark blue") +
coord_flip() +
theme(text = element_text(size=6))+
ggtitle ("Google leading innovation categories") + xlab("Total Count") + ylab("Patent Group Name")
popular_title
results$year<- as.numeric(as.character(results$year))
df10<-results %>%
filter(author =="ibm") %>%
select (description,patent_number) %>%
group_by(description) %>%
dplyr::summarise(patent_count = n_distinct(patent_number)) %>%
arrange(desc(patent_count)) %>%
top_n(n=10)
## Selecting by patent_count
popular_title<-ggplot(df10, aes(x=reorder(description,-patent_count), y=patent_count)) +
geom_bar(stat="identity")+theme_minimal() +
geom_col(aes(fill = patent_count)) +
scale_fill_gradient2(low = "white", high = "dark blue") +
coord_flip() +
theme(text = element_text(size=6))+
ggtitle ("IBM leading innovation categories") + xlab("Total Count") + ylab("Patent Group Name")
popular_title
results$year<- as.numeric(as.character(results$year))
df11<-results %>%
filter(author =="samsung") %>%
select (description,patent_number) %>%
group_by(description) %>%
dplyr::summarise(patent_count = n_distinct(patent_number)) %>%
arrange(desc(patent_count)) %>%
top_n(n=10)
## Selecting by patent_count
popular_title<-ggplot(df11, aes(x=reorder(description,-patent_count), y=patent_count)) +
geom_bar(stat="identity")+theme_minimal() +
geom_col(aes(fill = patent_count)) +
scale_fill_gradient2(low = "white", high = "dark blue") +
coord_flip() +
theme(text = element_text(size=6))+
ggtitle ("Samsung leading innovation categories") + xlab("Total Count") + ylab("Patent Group Name")
popular_title
results$year<- as.numeric(as.character(results$year))
df12<-results %>%
filter(author =="intel") %>%
select (description,patent_number) %>%
group_by(description) %>%
dplyr::summarise(patent_count = n_distinct(patent_number)) %>%
arrange(desc(patent_count)) %>%
top_n(n=10)
## Selecting by patent_count
popular_title<-ggplot(df12, aes(x=reorder(description,-patent_count), y=patent_count)) +
geom_bar(stat="identity")+theme_minimal() +
geom_col(aes(fill = patent_count)) +
scale_fill_gradient2(low = "white", high = "dark blue") +
coord_flip() +
theme(text = element_text(size=6))+
ggtitle ("Intel leading innovation categories") + xlab("Total Count") + ylab("Patent Group Name")
popular_title
results$year<- as.numeric(as.character(results$year))
df13<-results %>%
filter(author =="tencent") %>%
select (description,patent_number) %>%
group_by(description) %>%
dplyr::summarise(patent_count = n_distinct(patent_number)) %>%
arrange(desc(patent_count)) %>%
top_n(n=10)
## Selecting by patent_count
popular_title<-ggplot(df13, aes(x=reorder(description,-patent_count), y=patent_count)) +
geom_bar(stat="identity")+theme_minimal() +
geom_col(aes(fill = patent_count)) +
scale_fill_gradient2(low = "white", high = "dark blue") +
coord_flip() +
theme(text = element_text(size=6))+
ggtitle ("Tencent leading innovation categories") + xlab("Total Count") + ylab("Patent Group Name")
popular_title
results$year<- as.numeric(as.character(results$year))
df14<-results %>%
filter(author =="oracle") %>%
select (description,patent_number) %>%
group_by(description) %>%
dplyr::summarise(patent_count = n_distinct(patent_number)) %>%
arrange(desc(patent_count)) %>%
top_n(n=10)
## Selecting by patent_count
popular_title<-ggplot(df14, aes(x=reorder(description,-patent_count), y=patent_count)) +
geom_bar(stat="identity")+theme_minimal() +
geom_col(aes(fill = patent_count)) +
scale_fill_gradient2(low = "white", high = "dark blue") +
coord_flip() +
theme(text = element_text(size=6))+
ggtitle ("Oracle leading innovation categories") + xlab("Total Count") + ylab("Patent Group Name")
popular_title
results$year<- as.numeric(as.character(results$year))
df15<-results %>%
filter(author =="hon_hai") %>%
select (description,patent_number) %>%
group_by(description) %>%
dplyr::summarise(patent_count = n_distinct(patent_number)) %>%
arrange(desc(patent_count)) %>%
top_n(n=10)
## Selecting by patent_count
popular_title<-ggplot(df15, aes(x=reorder(description,-patent_count), y=patent_count)) +
geom_bar(stat="identity")+theme_minimal() +
geom_col(aes(fill = patent_count)) +
scale_fill_gradient2(low = "white", high = "dark blue") +
coord_flip() +
theme(text = element_text(size=6))+
ggtitle ("Hon Hai Precision leading innovation categories") + xlab("Total Count") + ylab("Patent Group Name")
popular_title
results$year<- as.numeric(as.character(results$year))
df16<-results %>%
filter(author =="microsoft") %>%
select (description,patent_number) %>%
group_by(description) %>%
dplyr::summarise(patent_count = n_distinct(patent_number)) %>%
arrange(desc(patent_count)) %>%
top_n(n=10)
## Selecting by patent_count
head(df16)
## # A tibble: 6 x 2
## description patent_count
## <chr> <int>
## 1 Miscellaneous 9038
## 2 Data processing: database and file management or data stru… 5769
## 3 Electrical computers and digital processing systems: multi… 5024
## 4 Data processing: presentation processing of document, oper… 4135
## 5 Recording, communication, or information retrieval equipme… 3567
## 6 Computer graphics processing and selective visual display … 2676
popular_title<-ggplot(df16, aes(x=reorder(description,-patent_count), y=patent_count)) +
geom_bar(stat="identity")+theme_minimal() +
geom_col(aes(fill = patent_count)) +
scale_fill_gradient2(low = "white", high = "dark blue") +
coord_flip() +
theme(text = element_text(size=6))+
ggtitle ("Microsoft leading innovation categories") + xlab("Total Count") + ylab("Patent Group Name")
popular_title
Conclusions ———–
Samsung, IBM and Microsoft are innovation leaders among top 10 tech companies selected for the purpose of this project.
Intel and Microsoft have noticeable decrease in thier innovation pace, where as google demonstrate steady increasing innovation pace.
General innovation pace of all 10 companies clearly show “innovation boom” after 1990.
General innovation trend of top 10 tech companies reflects dot-com crisis as well as financial crisis 2007-2008 with significant down in patent registrations during crisis time.
Main innovation categories change over time, but “Active solid state devices” is the leading patent category across all periods. Also the “evolution” of data related leading patent categories is the following: starting from data/information storage before 90th, then data processing: database and file management (1990-2020) to Recording, communication, or information retrieval equipment and image processing (from 2010 till present).
Innovation direction of tech companies reflects the specific of their businesses: