Innovations Of Top 10 Technology Companies

Introduction

We all know that patents is one of the forces driving progress, I was curious how leading tech companies perform in this field. Do they innovate a lot? How innovations rate fluctuates? What is the most popular category of innovation? What are global innovation trends?

Goal

For my final project I will perform a study analysis of the patent registration data of the top 10 Technology Companies in the world trying to find answers to the following questions:

Data Sources

  1. Patents registration data is taken from US Patents And Trademark Office website: https://www.uspto.gov/patents-application-process/search-patents
  2. Patents US class classification: https://www.uspto.gov/web/patents/classification/selectnumwithtitle.htm
  3. Top 10 Top 10 Technology Companies (2018): https://www.investopedia.com/articles/markets/030816/worlds-top-10-technology-companies-aapl-googl.asp

Limitations and assumptions:

  1. The defenition of a ‘digital tech business’ is a business that provides a digital technical service/product/platform/hardware, or heavily relies on it, as its primary revenue source. As the majority of medium to large companies use and heavily rely on technology. I would consider company as a Technology Company if it is in the business of SELLING technology. For the purpose of the project I took the list top 10 Technology Companies by Investopedia list.

https://www.investopedia.com/articles/markets/030816/worlds-top-10-technology-companies-aapl-googl.asp

  1. IMPORTANT: As patent registrtration process takes up to 2 years to be fully registred in US Patents And Trademark Office, The data for the last 2 years is ignored in my analysis.

Technical project overview

  1. Scraping patent registration data for the whole company history. Python language was used for data scraping.

Example of web page where data was scraped:

http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-bool.html&r=37&f=G&l=50&co1=AND&d=PTXT&s1=Apple.ASNM.&OS=AN/Apple&RS=AN/Apple

The following data was scraped for each company:

  1. Data Transformation: change initial data structure to data structure that is suitable and ready to use in relational database. Python language was used to change data structure.
  1. Saving data to csv files. Python language was used to save data in csv files.

  2. Importing csv file to MySql database. R language was used for the importing.

  3. Reading data from MySql database. R language and SQL language was used for reading data from database.

  4. Performing analysis. (R language)

Data Collection Statistics Overview:

Time period: 1968 till present

Number of patents regestred: 353683

Number of categories: 436

Number of companies analysed: 10

Note: Data Collection and Transformation Script is written in Python and presented in GitHub. The rest script is presented below.

library("stringr")
library("dplyr")
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library("tidyr")
library("RMySQL")
## Loading required package: DBI
library(DBI)
library("ggplot2")
#install.packages("sqldf")
#library("sqldf")
library(plyr)
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
library(wordcloud)
## Loading required package: RColorBrewer
#declare some constants
data.dir = "/Users/Olga/Desktop/PycharmProjects/r/data/"
input.files.dir = paste0(data.dir,"patents/")
classification.path = paste0(data.dir,"classification_csv.csv")
suffix = "_us_csv.csv"

#function for connecting to MySQL database
create.connection <- function(){
  connection <- dbConnect(MySQL(),
                          user="root", password="olga1234",
                          dbname="patents", host="localhost")
}

#function for creating table in db
create.table <- function(connection, schema, table){
  create.table.query = paste0("CREATE TABLE ",schema,".",table,"
                              (
                              author        varchar(100)  not null,
                              patent_number varchar(20)   not null,
                              date          varchar(20)   null,
                              year          varchar(20)   null,
                              us_cat        varchar(200)  null,
                              us_subcat     varchar(200)  null,
                              description   varchar(1000) null,
                              title         varchar(1000) null
                              )"
                         )
  created.table <- dbSendStatement(connection, create.table.query)
  dbClearResult(created.table)
}
#function for dropping db table
drop.table <- function(connection, schema, table){
  drop.table.query = paste0("DROP TABLE IF EXISTS ",schema,".",table)
  drop.table <- dbSendStatement(connection, drop.table.query)
  dbClearResult(drop.table)
}

read.db <- function(table){
  connection <- create.connection()
  query <- paste0("SELECT * FROM ",table)
  results <- dbGetQuery(connection,query)
  results
}
#function for escaping single quote
clense <- function(val){
  gsub("'", "''",toString(val))
}

#function for saving multiple records to db in a single transactopn
save.to.db <- function(connection,patents.df,author){
  for (row in 1:nrow(patents.df)) {
    patent.number <- clense(patents.df[row, "patent_number"])
    date  <- clense(patents.df[row, "date"])
    year  <- clense(patents.df[row, "year"])
    us_cat <- clense(patents.df[row,"us_cat"])
    us_subcat <- clense(patents.df[row,"us_subcat"])
    title <- clense(patents.df[row,"title"])
    description <- clense(patents.df[row,"description"])
    insert.query = paste("INSERT INTO patents.PATENTS VALUES ( '",author,"','",patent.number,"', '",date,"', '",year,"', '",us_cat,"', '",us_subcat,"', '",description,"', '",title ,"')",sep="")
    insert.row <-  dbSendStatement(connection, insert.query)
    dbClearResult(insert.row)
  }
}
# function for reading csv
read_csv <- function(path,connection){
  tryCatch({read.csv(file=path, fileEncoding ="cp1251", header=TRUE, quote = "\"", sep=",",stringsAsFactors=FALSE)}
           , error = function(e) {
             print(paste0("following file causing error: ",path))
             print(e)
             dbDisconnect(connection)
             stop(e)
           })
}
#this function does all the work with help of all functions declared above
saveall <- function(){
  connection <- create.connection()
  pattern = paste0(suffix,"$")
  files.list <- list.files(input.files.dir,pattern = pattern)
  classification.df = read_csv(classification.path, connection)
  #colnames(classification.df)[1] <- "cat"
  for (path in files.list){
    absolute.path = paste0(input.files.dir,"/",path)
    data.df <- read_csv(absolute.path, connection)
    df <-  left_join(data.df, classification.df, by = c('us_cat'))
    author = tolower(str_replace(path, suffix,"")[[1]][1])
    tryCatch({save.to.db(connection, df,author)}
             , error = function(e) {
               print(paste0("following file causing error: ",path))
               print(e)
               dbDisconnect(connection)
               stop(e)
             }
    )
  }
}

#saveall()

Analysis

connection <- create.connection()
query<- "SELECT * from patents.PATENTS"
results <- dbGetQuery(connection,query)
dbDisconnect(connection)
## [1] TRUE

General innovation pace overview by companies by year

df<-results %>% 
  select (author, year, patent_number) %>% 
  group_by(author, year) %>% 
  dplyr::summarise(patent_count = n_distinct(patent_number))
df$year<- as.numeric(as.character(df$year))
overall<-ggplot(data = df, aes(x=year, y = patent_count, color = author)) +
         geom_line()+ theme(axis.text.x = element_text(angle = 90, hjust =1))+scale_color_brewer(palette="Spectral")+
         ggtitle ("Number Of Registered Patents By Company By Year") + xlab("year") + ylab("number of patents")
overall

Noticable leaders of innovations among top 10 companies are IBM and Samsung.

Intel and Microsoft have steady decreasing pace in patent registrations.

Facebook and Tencent innovate least of all companies.

Google has a sharp increase in patent registration since 2010 and demonstrate good innovation potential.

Aggregated innovation pace overview by year (all 10 companies)

df1<-results %>% 
  select (year, patent_number) %>% 
  group_by(year) %>% 
  dplyr::summarise(patent_count = n_distinct(patent_number))
df1$year<- as.numeric(as.character(df1$year))
overall_agg<-ggplot()+
  geom_line(data = df1, aes(x = year, y = patent_count), color = "blue") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  ggtitle ("Number Of Registered Patents By Year (top 10 tech companies)") + xlab("year") + ylab("number of patents")
overall_agg

Graph shows that sharp increase in innovation pace starts from 1990.

Most likely following crises affected innovation pace of top 10 tech companies:

Evolution Of Innovations Of Top 10 companies (from 1968 till present)

In order to admit any changes in innovation trend over time, I will look at the leading innovation categories of the following periods:

Leading innovation categories before 1990 year

results$year<- as.numeric(as.character(results$year))
df3<-results %>% 
  filter (year<1990) %>% 
  select (description,patent_number) %>% 
  group_by(description) %>% 
  dplyr::summarise(patent_count = n_distinct(patent_number)) %>% 
  arrange(desc(patent_count)) %>% 
  top_n(n=10)
## Selecting by patent_count
head(df3)
## # A tibble: 6 x 2
##   description                                                 patent_count
##   <chr>                                                              <int>
## 1 Active solid-state devices (e.g., transistors, solid-state…         1481
## 2 Semiconductor device manufacturing: process                          734
## 3 INFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN REC…          724
## 4 Static information storage and retrieval                             612
## 5 Dynamic magnetic information storage or retrieval                    542
## 6 Error detection/correction and fault detection/recovery              521

Leading innovation categories from 1990 till 2000

results$year<- as.numeric(as.character(results$year))
df4<-results %>% 
  filter(between(year,1990,2000)) %>% 
  select (description,patent_number) %>% 
  group_by(description) %>% 
  dplyr::summarise(patent_count = n_distinct(patent_number)) %>% 
  arrange(desc(patent_count)) %>% 
  top_n(n=10)
## Selecting by patent_count
head(df4)
## # A tibble: 6 x 2
##   description                                                 patent_count
##   <chr>                                                              <int>
## 1 Active solid-state devices (e.g., transistors, solid-state…         6110
## 2 Data processing: database and file management or data stru…         4890
## 3 Electrical computers and digital processing systems: multi…         4298
## 4 INFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN REC…         3850
## 5 Electrical computers and digital processing systems: memory         3776
## 6 Error detection/correction and fault detection/recovery             3720

Leading innovation categories from 2000 till 2010

results$year<- as.numeric(as.character(results$year))
df5<-results %>% 
  filter(between(year, 2000 , 2010 )) %>% 
  select (description,patent_number) %>% 
  group_by(description) %>% 
  dplyr::summarise(patent_count = n_distinct(patent_number)) %>% 
  arrange(desc(patent_count)) %>% 
  top_n(n=10)
## Selecting by patent_count
head(df5)
## # A tibble: 6 x 2
##   description                                                 patent_count
##   <chr>                                                              <int>
## 1 Active solid-state devices (e.g., transistors, solid-state…        17760
## 2 Data processing: database and file management or data stru…        15230
## 3 Electrical computers and digital processing systems: multi…        14619
## 4 Semiconductor device manufacturing: process                        11149
## 5 Multiplex communications                                            9459
## 6 Miscellaneous                                                       8591

Leading innovation categories from 2010 till 2018

results$year<- as.numeric(as.character(results$year))
df6<-results %>% 
  filter(between(year, 2010 , 2018 )) %>% 
  select (description,patent_number) %>% 
  group_by(description) %>% 
  dplyr::summarise(patent_count = n_distinct(patent_number)) %>% 
  arrange(desc(patent_count)) %>% 
  top_n(n=10)
## Selecting by patent_count
head(df6)
## # A tibble: 6 x 2
##   description                                                 patent_count
##   <chr>                                                              <int>
## 1 Miscellaneous                                                      80641
## 2 Recording, communication, or information retrieval equipme…         9497
## 3 Active solid-state devices (e.g., transistors, solid-state…         6048
## 4 Data processing: database and file management or data stru…         4495
## 5 Semiconductor device manufacturing: process                         4357
## 6 Electrical computers and digital processing systems: multi…         4010

Main innovation categories change over time, but “Active solid state devices” is the leading patent category across all periods. Also the “evolution” of data related leading patent categories is the following: starting from data/information storage before 90th, then data processing: database and file management (1990-2020) to Recording, communication, or information retrieval equipment and image processing (from 2010 till present).

Leading innovation categories by companies

Apple

results$year<- as.numeric(as.character(results$year))
df7<-results %>% 
  filter(author =="apple") %>% 
  select (description,patent_number) %>% 
  group_by(description) %>% 
  dplyr::summarise(patent_count = n_distinct(patent_number)) %>% 
  arrange(desc(patent_count)) %>% 
  top_n(n=10)
## Selecting by patent_count
head(df7)
## # A tibble: 6 x 2
##   description                                                 patent_count
##   <chr>                                                              <int>
## 1 Miscellaneous                                                       6755
## 2 Recording, communication, or information retrieval equipme…         1951
## 3 Computer graphics processing and selective visual display …         1594
## 4 Data processing: presentation processing of document, oper…         1317
## 5 Telecommunications                                                   856
## 6 Multiplex communications                                             690

Facebook

results$year<- as.numeric(as.character(results$year))
df8<-results %>% 
  filter(author =="facebook") %>% 
  select (description,patent_number) %>% 
  group_by(description) %>% 
  dplyr::summarise(patent_count = n_distinct(patent_number)) %>% 
  arrange(desc(patent_count)) %>% 
  top_n(n=10)
## Selecting by patent_count
head(df8)
## # A tibble: 6 x 2
##   description                                                 patent_count
##   <chr>                                                              <int>
## 1 Miscellaneous                                                       1640
## 2 Electrical computers and digital processing systems: multi…          221
## 3 Data processing: database and file management or data stru…          128
## 4 Data processing: presentation processing of document, oper…          100
## 5 Data processing: financial, business practice, management,…           98
## 6 Information security                                                  69

Google

results$year<- as.numeric(as.character(results$year))
df9<-results %>% 
  filter(author =="google") %>% 
  select (description,patent_number) %>% 
  group_by(description) %>% 
  dplyr::summarise(patent_count = n_distinct(patent_number)) %>% 
  arrange(desc(patent_count)) %>% 
  top_n(n=10)
## Selecting by patent_count
head(df9)
## # A tibble: 6 x 2
##   description                                                 patent_count
##   <chr>                                                              <int>
## 1 Miscellaneous                                                       9256
## 2 Data processing: database and file management or data stru…         2129
## 3 Electrical computers and digital processing systems: multi…         1116
## 4 Data processing: presentation processing of document, oper…          964
## 5 Data processing: financial, business practice, management,…          770
## 6 Image analysis                                                       650

IBM

results$year<- as.numeric(as.character(results$year))
df10<-results %>% 
  filter(author =="ibm") %>% 
  select (description,patent_number) %>% 
  group_by(description) %>% 
  dplyr::summarise(patent_count = n_distinct(patent_number)) %>% 
  arrange(desc(patent_count)) %>% 
  top_n(n=10)
## Selecting by patent_count
head(df10)
## # A tibble: 6 x 2
##   description                                                 patent_count
##   <chr>                                                              <int>
## 1 Miscellaneous                                                      29930
## 2 Active solid-state devices (e.g., transistors, solid-state…        12781
## 3 Data processing: database and file management or data stru…        10689
## 4 Electrical computers and digital processing systems: multi…         9930
## 5 Semiconductor device manufacturing: process                         8223
## 6 Electrical computers and digital processing systems: memory         7400

Samsung

results$year<- as.numeric(as.character(results$year))
df11<-results %>% 
  filter(author =="samsung") %>% 
  select (description,patent_number) %>% 
  group_by(description) %>% 
  dplyr::summarise(patent_count = n_distinct(patent_number)) %>% 
  arrange(desc(patent_count)) %>% 
  top_n(n=10)
## Selecting by patent_count
head(df11)
## # A tibble: 6 x 2
##   description                                                 patent_count
##   <chr>                                                              <int>
## 1 Miscellaneous                                                      26490
## 2 Active solid-state devices (e.g., transistors, solid-state…        11322
## 3 Semiconductor device manufacturing: process                         7068
## 4 Recording, communication, or information retrieval equipme…         6830
## 5 Computer graphics processing and selective visual display …         4867
## 6 Telecommunications                                                  4640

Intel

results$year<- as.numeric(as.character(results$year))
df12<-results %>% 
  filter(author =="intel") %>% 
  select (description,patent_number) %>% 
  group_by(description) %>% 
  dplyr::summarise(patent_count = n_distinct(patent_number)) %>% 
  arrange(desc(patent_count)) %>% 
  top_n(n=10)
## Selecting by patent_count
head(df12)
## # A tibble: 6 x 2
##   description                                                 patent_count
##   <chr>                                                              <int>
## 1 Active solid-state devices (e.g., transistors, solid-state…         4048
## 2 Electrical computers and digital processing systems: memory         2752
## 3 Electrical computers and digital processing systems: suppo…         2634
## 4 Multiplex communications                                            2587
## 5 Semiconductor device manufacturing: process                         2203
## 6 Electrical computers and digital data processing systems: …         2190

Tencent

results$year<- as.numeric(as.character(results$year))
df13<-results %>% 
  filter(author =="tencent") %>% 
  select (description,patent_number) %>% 
  group_by(description) %>% 
  dplyr::summarise(patent_count = n_distinct(patent_number)) %>% 
  arrange(desc(patent_count)) %>% 
  top_n(n=10)
## Selecting by patent_count
head(df13)
## # A tibble: 6 x 2
##   description                                                 patent_count
##   <chr>                                                              <int>
## 1 Miscellaneous                                                        861
## 2 Recording, communication, or information retrieval equipme…          165
## 3 Electrical computers and digital processing systems: multi…           57
## 4 Information security                                                  19
## 5 Data processing: presentation processing of document, oper…           15
## 6 Data processing: database and file management or data stru…           14

Oracle

results$year<- as.numeric(as.character(results$year))
df14<-results %>% 
  filter(author =="oracle") %>% 
  select (description,patent_number) %>% 
  group_by(description) %>% 
  dplyr::summarise(patent_count = n_distinct(patent_number)) %>% 
  arrange(desc(patent_count)) %>% 
  top_n(n=10)
## Selecting by patent_count
head(df14)
## # A tibble: 6 x 2
##   description                                                 patent_count
##   <chr>                                                              <int>
## 1 Miscellaneous                                                       2854
## 2 Data processing: database and file management or data stru…         2183
## 3 Electrical computers and digital processing systems: multi…         1097
## 4 Data processing: software development, installation, and m…          633
## 5 Electrical computers and digital processing systems: memory          568
## 6 Data processing: presentation processing of document, oper…          446

Hon Hai Precision

results$year<- as.numeric(as.character(results$year))
df15<-results %>% 
  filter(author =="hon_hai") %>% 
  select (description,patent_number) %>% 
  group_by(description) %>% 
  dplyr::summarise(patent_count = n_distinct(patent_number)) %>% 
  arrange(desc(patent_count)) %>% 
  top_n(n=10)
## Selecting by patent_count
head(df15)
## # A tibble: 6 x 2
##   description                                                 patent_count
##   <chr>                                                              <int>
## 1 Electrical connectors                                               4596
## 2 Electricity: electrical systems and devices                         2432
## 3 Miscellaneous                                                       2323
## 4 Recording, communication, or information retrieval equipme…         1013
## 5 Equipment for production, distribution, or transformation …          670
## 6 Supports: cabinet structure                                          646

Microsoft

results$year<- as.numeric(as.character(results$year))
df16<-results %>% 
  filter(author =="microsoft") %>% 
  select (description,patent_number) %>% 
  group_by(description) %>% 
  dplyr::summarise(patent_count = n_distinct(patent_number)) %>% 
  arrange(desc(patent_count)) %>% 
  top_n(n=10)
## Selecting by patent_count
head(df16)
## # A tibble: 6 x 2
##   description                                                 patent_count
##   <chr>                                                              <int>
## 1 Miscellaneous                                                       9038
## 2 Data processing: database and file management or data stru…         5769
## 3 Electrical computers and digital processing systems: multi…         5024
## 4 Data processing: presentation processing of document, oper…         4135
## 5 Recording, communication, or information retrieval equipme…         3567
## 6 Computer graphics processing and selective visual display …         2676

Conclusions