We all know that patents is one of the forces driving progress, I was curious how leading tech companies perform in this field. Do they innovate a lot? How innovations rate fluctuates? What is the most popular category of innovation? What are global innovation trends?
For my final project I will perform a study analysis of the patent registration data of the top 10 Technology Companies in the world trying to find answers to the following questions:
Example of web page where data was scraped:
http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-bool.html&r=37&f=G&l=50&co1=AND&d=PTXT&s1=Apple.ASNM.&OS=AN/Apple&RS=AN/Apple
The following data was scraped for each company:
United States Patent Number
Date Field
Current U.S. Class
Patent title
Assignee name
link
Saving data to csv files. Python language was used to save data in csv files.
Importing csv file to MySql database. R language was used for the importing.
Reading data from MySql database. R language and SQL language was used for reading data from database.
Performing analysis. (R language)
Time period: 1968 till present
Number of patents regestred: 353683
Number of categories: 436
Number of companies analysed: 10
Note: Data Collection and Transformation Script is written in Python and presented in GitHub. The rest script is presented below.
library("stringr")
library("dplyr")
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library("tidyr")
library("RMySQL")
## Loading required package: DBI
library(DBI)
library("ggplot2")
#install.packages("sqldf")
#library("sqldf")
library(plyr)
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
library(wordcloud)
## Loading required package: RColorBrewer
#declare some constants
data.dir = "/Users/Olga/Desktop/PycharmProjects/r/data/"
input.files.dir = paste0(data.dir,"patents/")
classification.path = paste0(data.dir,"classification_csv.csv")
suffix = "_us_csv.csv"
#function for connecting to MySQL database
create.connection <- function(){
connection <- dbConnect(MySQL(),
user="root", password="olga1234",
dbname="patents", host="localhost")
}
#function for creating table in db
create.table <- function(connection, schema, table){
create.table.query = paste0("CREATE TABLE ",schema,".",table,"
(
author varchar(100) not null,
patent_number varchar(20) not null,
date varchar(20) null,
year varchar(20) null,
us_cat varchar(200) null,
us_subcat varchar(200) null,
description varchar(1000) null,
title varchar(1000) null
)"
)
created.table <- dbSendStatement(connection, create.table.query)
dbClearResult(created.table)
}
#function for dropping db table
drop.table <- function(connection, schema, table){
drop.table.query = paste0("DROP TABLE IF EXISTS ",schema,".",table)
drop.table <- dbSendStatement(connection, drop.table.query)
dbClearResult(drop.table)
}
read.db <- function(table){
connection <- create.connection()
query <- paste0("SELECT * FROM ",table)
results <- dbGetQuery(connection,query)
results
}
#function for escaping single quote
clense <- function(val){
gsub("'", "''",toString(val))
}
#function for saving multiple records to db in a single transactopn
save.to.db <- function(connection,patents.df,author){
for (row in 1:nrow(patents.df)) {
patent.number <- clense(patents.df[row, "patent_number"])
date <- clense(patents.df[row, "date"])
year <- clense(patents.df[row, "year"])
us_cat <- clense(patents.df[row,"us_cat"])
us_subcat <- clense(patents.df[row,"us_subcat"])
title <- clense(patents.df[row,"title"])
description <- clense(patents.df[row,"description"])
insert.query = paste("INSERT INTO patents.PATENTS VALUES ( '",author,"','",patent.number,"', '",date,"', '",year,"', '",us_cat,"', '",us_subcat,"', '",description,"', '",title ,"')",sep="")
insert.row <- dbSendStatement(connection, insert.query)
dbClearResult(insert.row)
}
}
# function for reading csv
read_csv <- function(path,connection){
tryCatch({read.csv(file=path, fileEncoding ="cp1251", header=TRUE, quote = "\"", sep=",",stringsAsFactors=FALSE)}
, error = function(e) {
print(paste0("following file causing error: ",path))
print(e)
dbDisconnect(connection)
stop(e)
})
}
#this function does all the work with help of all functions declared above
saveall <- function(){
connection <- create.connection()
pattern = paste0(suffix,"$")
files.list <- list.files(input.files.dir,pattern = pattern)
classification.df = read_csv(classification.path, connection)
#colnames(classification.df)[1] <- "cat"
for (path in files.list){
absolute.path = paste0(input.files.dir,"/",path)
data.df <- read_csv(absolute.path, connection)
df <- left_join(data.df, classification.df, by = c('us_cat'))
author = tolower(str_replace(path, suffix,"")[[1]][1])
tryCatch({save.to.db(connection, df,author)}
, error = function(e) {
print(paste0("following file causing error: ",path))
print(e)
dbDisconnect(connection)
stop(e)
}
)
}
}
#saveall()
connection <- create.connection()
query<- "SELECT * from patents.PATENTS"
results <- dbGetQuery(connection,query)
dbDisconnect(connection)
## [1] TRUE
df<-results %>%
select (author, year, patent_number) %>%
group_by(author, year) %>%
dplyr::summarise(patent_count = n_distinct(patent_number))
df$year<- as.numeric(as.character(df$year))
overall<-ggplot(data = df, aes(x=year, y = patent_count, color = author)) +
geom_line()+ theme(axis.text.x = element_text(angle = 90, hjust =1))+scale_color_brewer(palette="Spectral")+
ggtitle ("Number Of Registered Patents By Company By Year") + xlab("year") + ylab("number of patents")
overall
Noticable leaders of innovations among top 10 companies are IBM and Samsung.
Intel and Microsoft have steady decreasing pace in patent registrations.
Facebook and Tencent innovate least of all companies.
Google has a sharp increase in patent registration since 2010 and demonstrate good innovation potential.
df1<-results %>%
select (year, patent_number) %>%
group_by(year) %>%
dplyr::summarise(patent_count = n_distinct(patent_number))
df1$year<- as.numeric(as.character(df1$year))
overall_agg<-ggplot()+
geom_line(data = df1, aes(x = year, y = patent_count), color = "blue") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
ggtitle ("Number Of Registered Patents By Year (top 10 tech companies)") + xlab("year") + ylab("number of patents")
overall_agg
Graph shows that sharp increase in innovation pace starts from 1990.
Most likely following crises affected innovation pace of top 10 tech companies:
dot-com bubble crisis slowed the pace a bit, but after that steady growth have being continued.
Economic crisis 2007-2008 significantly negatively affected the innovation pace and only in 2012 innovation pace exceed the pace of pre crisis pace.
df2<-results %>%
select (description,patent_number) %>%
group_by(description) %>%
dplyr::summarise(patent_count = n_distinct(patent_number)) %>%
arrange(desc(patent_count)) %>%
top_n(n=20)
## Selecting by patent_count
popular_title<-ggplot(df2, aes(x=reorder(description,-patent_count), y=patent_count)) +
geom_bar(stat="identity")+theme_minimal() +
geom_col(aes(fill = patent_count)) +
scale_fill_gradient2(low = "white", high = "blue") +
coord_flip() +
theme(text = element_text(size=6))+
ggtitle ("The most popular patents of all time among top 10 tech companies") + xlab("Total Count") + ylab("Patent Group Name")
popular_title
If I ignore Miscellaneous category as it tells a litle about patent nature, the leading patent categories are active solid-atate devices; data processing: database and file management; electrical computers and digital processing systems.
In order to admit any changes in innovation trend over time, I will look at the leading innovation categories of the following periods:
before 1990 year
from 1990 till 2000
from 2000 till 2010
2010 till present
results$year<- as.numeric(as.character(results$year))
df3<-results %>%
filter (year<1990) %>%
select (description,patent_number) %>%
group_by(description) %>%
dplyr::summarise(patent_count = n_distinct(patent_number)) %>%
arrange(desc(patent_count)) %>%
top_n(n=10)
## Selecting by patent_count
head(df3)
## # A tibble: 6 x 2
## description patent_count
## <chr> <int>
## 1 Active solid-state devices (e.g., transistors, solid-state… 1481
## 2 Semiconductor device manufacturing: process 734
## 3 INFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN REC… 724
## 4 Static information storage and retrieval 612
## 5 Dynamic magnetic information storage or retrieval 542
## 6 Error detection/correction and fault detection/recovery 521
results$year<- as.numeric(as.character(results$year))
df4<-results %>%
filter(between(year,1990,2000)) %>%
select (description,patent_number) %>%
group_by(description) %>%
dplyr::summarise(patent_count = n_distinct(patent_number)) %>%
arrange(desc(patent_count)) %>%
top_n(n=10)
## Selecting by patent_count
head(df4)
## # A tibble: 6 x 2
## description patent_count
## <chr> <int>
## 1 Active solid-state devices (e.g., transistors, solid-state… 6110
## 2 Data processing: database and file management or data stru… 4890
## 3 Electrical computers and digital processing systems: multi… 4298
## 4 INFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN REC… 3850
## 5 Electrical computers and digital processing systems: memory 3776
## 6 Error detection/correction and fault detection/recovery 3720
results$year<- as.numeric(as.character(results$year))
df5<-results %>%
filter(between(year, 2000 , 2010 )) %>%
select (description,patent_number) %>%
group_by(description) %>%
dplyr::summarise(patent_count = n_distinct(patent_number)) %>%
arrange(desc(patent_count)) %>%
top_n(n=10)
## Selecting by patent_count
head(df5)
## # A tibble: 6 x 2
## description patent_count
## <chr> <int>
## 1 Active solid-state devices (e.g., transistors, solid-state… 17760
## 2 Data processing: database and file management or data stru… 15230
## 3 Electrical computers and digital processing systems: multi… 14619
## 4 Semiconductor device manufacturing: process 11149
## 5 Multiplex communications 9459
## 6 Miscellaneous 8591
results$year<- as.numeric(as.character(results$year))
df6<-results %>%
filter(between(year, 2010 , 2018 )) %>%
select (description,patent_number) %>%
group_by(description) %>%
dplyr::summarise(patent_count = n_distinct(patent_number)) %>%
arrange(desc(patent_count)) %>%
top_n(n=10)
## Selecting by patent_count
head(df6)
## # A tibble: 6 x 2
## description patent_count
## <chr> <int>
## 1 Miscellaneous 80641
## 2 Recording, communication, or information retrieval equipme… 9497
## 3 Active solid-state devices (e.g., transistors, solid-state… 6048
## 4 Data processing: database and file management or data stru… 4495
## 5 Semiconductor device manufacturing: process 4357
## 6 Electrical computers and digital processing systems: multi… 4010
Main innovation categories change over time, but “Active solid state devices” is the leading patent category across all periods. Also the “evolution” of data related leading patent categories is the following: starting from data/information storage before 90th, then data processing: database and file management (1990-2020) to Recording, communication, or information retrieval equipment and image processing (from 2010 till present).
results$year<- as.numeric(as.character(results$year))
df7<-results %>%
filter(author =="apple") %>%
select (description,patent_number) %>%
group_by(description) %>%
dplyr::summarise(patent_count = n_distinct(patent_number)) %>%
arrange(desc(patent_count)) %>%
top_n(n=10)
## Selecting by patent_count
head(df7)
## # A tibble: 6 x 2
## description patent_count
## <chr> <int>
## 1 Miscellaneous 6755
## 2 Recording, communication, or information retrieval equipme… 1951
## 3 Computer graphics processing and selective visual display … 1594
## 4 Data processing: presentation processing of document, oper… 1317
## 5 Telecommunications 856
## 6 Multiplex communications 690
results$year<- as.numeric(as.character(results$year))
df8<-results %>%
filter(author =="facebook") %>%
select (description,patent_number) %>%
group_by(description) %>%
dplyr::summarise(patent_count = n_distinct(patent_number)) %>%
arrange(desc(patent_count)) %>%
top_n(n=10)
## Selecting by patent_count
head(df8)
## # A tibble: 6 x 2
## description patent_count
## <chr> <int>
## 1 Miscellaneous 1640
## 2 Electrical computers and digital processing systems: multi… 221
## 3 Data processing: database and file management or data stru… 128
## 4 Data processing: presentation processing of document, oper… 100
## 5 Data processing: financial, business practice, management,… 98
## 6 Information security 69
results$year<- as.numeric(as.character(results$year))
df9<-results %>%
filter(author =="google") %>%
select (description,patent_number) %>%
group_by(description) %>%
dplyr::summarise(patent_count = n_distinct(patent_number)) %>%
arrange(desc(patent_count)) %>%
top_n(n=10)
## Selecting by patent_count
head(df9)
## # A tibble: 6 x 2
## description patent_count
## <chr> <int>
## 1 Miscellaneous 9256
## 2 Data processing: database and file management or data stru… 2129
## 3 Electrical computers and digital processing systems: multi… 1116
## 4 Data processing: presentation processing of document, oper… 964
## 5 Data processing: financial, business practice, management,… 770
## 6 Image analysis 650
results$year<- as.numeric(as.character(results$year))
df10<-results %>%
filter(author =="ibm") %>%
select (description,patent_number) %>%
group_by(description) %>%
dplyr::summarise(patent_count = n_distinct(patent_number)) %>%
arrange(desc(patent_count)) %>%
top_n(n=10)
## Selecting by patent_count
head(df10)
## # A tibble: 6 x 2
## description patent_count
## <chr> <int>
## 1 Miscellaneous 29930
## 2 Active solid-state devices (e.g., transistors, solid-state… 12781
## 3 Data processing: database and file management or data stru… 10689
## 4 Electrical computers and digital processing systems: multi… 9930
## 5 Semiconductor device manufacturing: process 8223
## 6 Electrical computers and digital processing systems: memory 7400
results$year<- as.numeric(as.character(results$year))
df11<-results %>%
filter(author =="samsung") %>%
select (description,patent_number) %>%
group_by(description) %>%
dplyr::summarise(patent_count = n_distinct(patent_number)) %>%
arrange(desc(patent_count)) %>%
top_n(n=10)
## Selecting by patent_count
head(df11)
## # A tibble: 6 x 2
## description patent_count
## <chr> <int>
## 1 Miscellaneous 26490
## 2 Active solid-state devices (e.g., transistors, solid-state… 11322
## 3 Semiconductor device manufacturing: process 7068
## 4 Recording, communication, or information retrieval equipme… 6830
## 5 Computer graphics processing and selective visual display … 4867
## 6 Telecommunications 4640
results$year<- as.numeric(as.character(results$year))
df12<-results %>%
filter(author =="intel") %>%
select (description,patent_number) %>%
group_by(description) %>%
dplyr::summarise(patent_count = n_distinct(patent_number)) %>%
arrange(desc(patent_count)) %>%
top_n(n=10)
## Selecting by patent_count
head(df12)
## # A tibble: 6 x 2
## description patent_count
## <chr> <int>
## 1 Active solid-state devices (e.g., transistors, solid-state… 4048
## 2 Electrical computers and digital processing systems: memory 2752
## 3 Electrical computers and digital processing systems: suppo… 2634
## 4 Multiplex communications 2587
## 5 Semiconductor device manufacturing: process 2203
## 6 Electrical computers and digital data processing systems: … 2190
results$year<- as.numeric(as.character(results$year))
df13<-results %>%
filter(author =="tencent") %>%
select (description,patent_number) %>%
group_by(description) %>%
dplyr::summarise(patent_count = n_distinct(patent_number)) %>%
arrange(desc(patent_count)) %>%
top_n(n=10)
## Selecting by patent_count
head(df13)
## # A tibble: 6 x 2
## description patent_count
## <chr> <int>
## 1 Miscellaneous 861
## 2 Recording, communication, or information retrieval equipme… 165
## 3 Electrical computers and digital processing systems: multi… 57
## 4 Information security 19
## 5 Data processing: presentation processing of document, oper… 15
## 6 Data processing: database and file management or data stru… 14
results$year<- as.numeric(as.character(results$year))
df14<-results %>%
filter(author =="oracle") %>%
select (description,patent_number) %>%
group_by(description) %>%
dplyr::summarise(patent_count = n_distinct(patent_number)) %>%
arrange(desc(patent_count)) %>%
top_n(n=10)
## Selecting by patent_count
head(df14)
## # A tibble: 6 x 2
## description patent_count
## <chr> <int>
## 1 Miscellaneous 2854
## 2 Data processing: database and file management or data stru… 2183
## 3 Electrical computers and digital processing systems: multi… 1097
## 4 Data processing: software development, installation, and m… 633
## 5 Electrical computers and digital processing systems: memory 568
## 6 Data processing: presentation processing of document, oper… 446
results$year<- as.numeric(as.character(results$year))
df15<-results %>%
filter(author =="hon_hai") %>%
select (description,patent_number) %>%
group_by(description) %>%
dplyr::summarise(patent_count = n_distinct(patent_number)) %>%
arrange(desc(patent_count)) %>%
top_n(n=10)
## Selecting by patent_count
head(df15)
## # A tibble: 6 x 2
## description patent_count
## <chr> <int>
## 1 Electrical connectors 4596
## 2 Electricity: electrical systems and devices 2432
## 3 Miscellaneous 2323
## 4 Recording, communication, or information retrieval equipme… 1013
## 5 Equipment for production, distribution, or transformation … 670
## 6 Supports: cabinet structure 646
results$year<- as.numeric(as.character(results$year))
df16<-results %>%
filter(author =="microsoft") %>%
select (description,patent_number) %>%
group_by(description) %>%
dplyr::summarise(patent_count = n_distinct(patent_number)) %>%
arrange(desc(patent_count)) %>%
top_n(n=10)
## Selecting by patent_count
head(df16)
## # A tibble: 6 x 2
## description patent_count
## <chr> <int>
## 1 Miscellaneous 9038
## 2 Data processing: database and file management or data stru… 5769
## 3 Electrical computers and digital processing systems: multi… 5024
## 4 Data processing: presentation processing of document, oper… 4135
## 5 Recording, communication, or information retrieval equipme… 3567
## 6 Computer graphics processing and selective visual display … 2676
Samsung, IBM and Microsoft are innovation leaders among top 10 tech companies selected for the purpose of this project.
Intel and Microsoft have noticeable decrease in thier innovation pace, where as google demonstrate steady increasing innovation pace.
General innovation pace of all 10 companies clearly show “innovation boom” after 1990.
General innovation trend of top 10 tech companies reflects dot-com crisis as well as financial crisis 2007-2008 with significant down in patent registrations during crisis time.
Main innovation categories change over time, but “Active solid state devices” is the leading patent category across all periods. Also the “evolution” of data related leading patent categories is the following: starting from data/information storage before 90th, then data processing: database and file management (1990-2020) to Recording, communication, or information retrieval equipment and image processing (from 2010 till present).
Innovation direction of tech companies reflects the specific of their businesses: