Final Project

Innovations Of Top 10 Technology Companies

Introduction

We all know that patents is one of the forces driving progress, I was curious how leading tech companies perform in this field. Do they innovate a lot? How innovations rate fluctuates? What is the most popular category of innovation? What are global innovation trends?

Goal

For my final project I will perform a study analysis of the patent registration data of the top 10 Technology Companies in the world trying to find answers to the following questions:

what is the overall innovation trend of top 10 tech companies?
how many patents tech companies register during their entire history, who is at the top of this list?
how do companies’ main innovation categories change over time?
what are the area of main focus for innovation of each company?

Data Sources

Patents registration data is taken from US Patents And Trademark Office website: https://www.uspto.gov/patents-application-process/search-patents
Patents US class classification: https://www.uspto.gov/web/patents/classification/selectnumwithtitle.htm
Top 10 Top 10 Technology Companies (2018): https://www.investopedia.com/articles/markets/030816/worlds-top-10-technology-companies-aapl-googl.asp

Limitations and assumptions:

The defenition of a ‘digital tech business’ is a business that provides a digital technical service/product/platform/hardware, or heavily relies on it, as its primary revenue source. As the majority of medium to large companies use and heavily rely on technology. I would consider company as a Technology Company if it is in the business of SELLING technology. For the purpose of the project I took the list top 10 Technology Companies by Investopedia list.

https://www.investopedia.com/articles/markets/030816/worlds-top-10-technology-companies-aapl-googl.asp

Apple (Market value: $995.50 billion)
Samsung (Market value: $765.26 billion)
Microsoft (Market value: $855.4 billion)
Google (Alphabet) (Market value: $756.85 billion)
Intel (Market value: $220.33 billion)
IBM (Market value: $112.47 billion)
Facebook (Market value: $434.66 billion)
Hon Hai Precision (Market cap: $6023.96 billion)
Tencent (Market value: $356.11 billion)
Oracle (Market value: $190.38 billion)

IMPORTANT: As patent registrtration process takes up to 2 years to be fully registred in US Patents And Trademark Office, The data for the last 2 years is ignored in my analysis.

Technical project overview

Scraping patent registration data for the whole company history. Python language was used for data scraping.

Example of web page where data was scraped:

http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-bool.html&r=37&f=G&l=50&co1=AND&d=PTXT&s1=Apple.ASNM.&OS=AN/Apple&RS=AN/Apple

The following data was scraped for each company:

United States Patent Number
Date Field
Current U.S. Class
Patent title
Assignee name
link

Data Transformation: change initial data structure to data structure that is suitable and ready to use in relational database. Python language was used to change data structure.

current U.S. Class was splitted into 2 columns (before and after slash): us_category and us_subcategory. Only us_category was used in the analysis.
year was extracted from data column
companies name was added (“author” column)
only top 4 columns of scraped data will be used for my analysis (patent_number, date, us_cat, title)

Saving data to csv files. Python language was used to save data in csv files.
Importing csv file to MySql database. R language was used for the importing.
Reading data from MySql database. R language and SQL language was used for reading data from database.
Performing analysis. (R language)

Data Collection Statistics Overview:

Time period: 1968 till present

Number of patents regestred: 353683

Number of categories: 436

Number of companies analysed: 10

Note: Data Collection and Transformation Script is written in Python and presented in GitHub. The rest script is presented below.

library("stringr")
library("dplyr")

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library("tidyr")
library("RMySQL")

## Loading required package: DBI

library(DBI)
library("ggplot2")
#install.packages("sqldf")
#library("sqldf")
library(plyr)

## -------------------------------------------------------------------------

## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)

## -------------------------------------------------------------------------

## 
## Attaching package: 'plyr'

## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

library(wordcloud)

## Loading required package: RColorBrewer

#declare some constants
data.dir = "/Users/Olga/Desktop/PycharmProjects/r/data/"
input.files.dir = paste0(data.dir,"patents/")
classification.path = paste0(data.dir,"classification_csv.csv")
suffix = "_us_csv.csv"

#function for connecting to MySQL database
create.connection <- function(){
  connection <- dbConnect(MySQL(),
                          user="root", password="olga1234",
                          dbname="patents", host="localhost")
}

#function for creating table in db
create.table <- function(connection, schema, table){
  create.table.query = paste0("CREATE TABLE ",schema,".",table,"
                              (
                              author        varchar(100)  not null,
                              patent_number varchar(20)   not null,
                              date          varchar(20)   null,
                              year          varchar(20)   null,
                              us_cat        varchar(200)  null,
                              us_subcat     varchar(200)  null,
                              description   varchar(1000) null,
                              title         varchar(1000) null
                              )"
                         )
  created.table <- dbSendStatement(connection, create.table.query)
  dbClearResult(created.table)
}
#function for dropping db table
drop.table <- function(connection, schema, table){
  drop.table.query = paste0("DROP TABLE IF EXISTS ",schema,".",table)
  drop.table <- dbSendStatement(connection, drop.table.query)
  dbClearResult(drop.table)
}

read.db <- function(table){
  connection <- create.connection()
  query <- paste0("SELECT * FROM ",table)
  results <- dbGetQuery(connection,query)
  results
}
#function for escaping single quote
clense <- function(val){
  gsub("'", "''",toString(val))
}

#function for saving multiple records to db in a single transactopn
save.to.db <- function(connection,patents.df,author){
  for (row in 1:nrow(patents.df)) {
    patent.number <- clense(patents.df[row, "patent_number"])
    date  <- clense(patents.df[row, "date"])
    year  <- clense(patents.df[row, "year"])
    us_cat <- clense(patents.df[row,"us_cat"])
    us_subcat <- clense(patents.df[row,"us_subcat"])
    title <- clense(patents.df[row,"title"])
    description <- clense(patents.df[row,"description"])
    insert.query = paste("INSERT INTO patents.PATENTS VALUES ( '",author,"','",patent.number,"', '",date,"', '",year,"', '",us_cat,"', '",us_subcat,"', '",description,"', '",title ,"')",sep="")
    insert.row <-  dbSendStatement(connection, insert.query)
    dbClearResult(insert.row)
  }
}
# function for reading csv
read_csv <- function(path,connection){
  tryCatch({read.csv(file=path, fileEncoding ="cp1251", header=TRUE, quote = "\"", sep=",",stringsAsFactors=FALSE)}
           , error = function(e) {
             print(paste0("following file causing error: ",path))
             print(e)
             dbDisconnect(connection)
             stop(e)
           })
}
#this function does all the work with help of all functions declared above
saveall <- function(){
  connection <- create.connection()
  pattern = paste0(suffix,"$")
  files.list <- list.files(input.files.dir,pattern = pattern)
  classification.df = read_csv(classification.path, connection)
  #colnames(classification.df)[1] <- "cat"
  for (path in files.list){
    absolute.path = paste0(input.files.dir,"/",path)
    data.df <- read_csv(absolute.path, connection)
    df <-  left_join(data.df, classification.df, by = c('us_cat'))
    author = tolower(str_replace(path, suffix,"")[[1]][1])
    tryCatch({save.to.db(connection, df,author)}
             , error = function(e) {
               print(paste0("following file causing error: ",path))
               print(e)
               dbDisconnect(connection)
               stop(e)
             }
    )
  }
}

#saveall()

Analysis

connection <- create.connection()
query<- "SELECT * from patents.PATENTS"
results <- dbGetQuery(connection,query)
dbDisconnect(connection)

## [1] TRUE

General innovation pace overview by companies by year

df<-results %>% 
  select (author, year, patent_number) %>% 
  group_by(author, year) %>% 
  dplyr::summarise(patent_count = n_distinct(patent_number))

df$year<- as.numeric(as.character(df$year))
overall<-ggplot(data = df, aes(x=year, y = patent_count, color = author)) +
         geom_line()+ theme(axis.text.x = element_text(angle = 90, hjust =1))+scale_color_brewer(palette="Spectral")+
         ggtitle ("Number Of Registered Patents By Company By Year") + xlab("year") + ylab("number of patents")
overall

Noticable leaders of innovations among top 10 companies are IBM and Samsung.

Intel and Microsoft have steady decreasing pace in patent registrations.

Facebook and Tencent innovate least of all companies.

Google has a sharp increase in patent registration since 2010 and demonstrate good innovation potential.

Aggregated innovation pace overview by year (all 10 companies)

df1<-results %>% 
  select (year, patent_number) %>% 
  group_by(year) %>% 
  dplyr::summarise(patent_count = n_distinct(patent_number))

df1$year<- as.numeric(as.character(df1$year))
overall_agg<-ggplot()+
  geom_line(data = df1, aes(x = year, y = patent_count), color = "blue") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  ggtitle ("Number Of Registered Patents By Year (top 10 tech companies)") + xlab("year") + ylab("number of patents")
overall_agg

Graph shows that sharp increase in innovation pace starts from 1990.

Most likely following crises affected innovation pace of top 10 tech companies:

dot-com bubble crisis slowed the pace a bit, but after that steady growth have being continued.
Economic crisis 2007-2008 significantly negatively affected the innovation pace and only in 2012 innovation pace exceed the pace of pre crisis pace.

What do tech companies innovate most (top 20 most popular patent categories)?

df2<-results %>% 
  select (description,patent_number) %>% 
  group_by(description) %>% 
  dplyr::summarise(patent_count = n_distinct(patent_number)) %>% 
  arrange(desc(patent_count)) %>% 
  top_n(n=20)

## Selecting by patent_count

popular_title<-ggplot(df2, aes(x=reorder(description,-patent_count), y=patent_count)) +
  geom_bar(stat="identity")+theme_minimal() +
  geom_col(aes(fill = patent_count)) + 
  scale_fill_gradient2(low = "white", high = "blue") + 
  coord_flip() + 
  theme(text = element_text(size=6))+
  ggtitle ("The most popular patents of all time among top 10 tech companies") + xlab("Total Count") + ylab("Patent Group Name")

popular_title

If I ignore Miscellaneous category as it tells a litle about patent nature, the leading patent categories are active solid-atate devices; data processing: database and file management; electrical computers and digital processing systems.

Evolution Of Innovations Of Top 10 companies (from 1968 till present)

In order to admit any changes in innovation trend over time, I will look at the leading innovation categories of the following periods:

before 1990 year
from 1990 till 2000
from 2000 till 2010
2010 till present

Leading innovation categories before 1990 year

results$year<- as.numeric(as.character(results$year))
df3<-results %>% 
  filter (year<1990) %>% 
  select (description,patent_number) %>% 
  group_by(description) %>% 
  dplyr::summarise(patent_count = n_distinct(patent_number)) %>% 
  arrange(desc(patent_count)) %>% 
  top_n(n=10)

## Selecting by patent_count

head(df3)

## # A tibble: 6 x 2
##   description                                                 patent_count
##   <chr>                                                              <int>
## 1 Active solid-state devices (e.g., transistors, solid-state…         1481
## 2 Semiconductor device manufacturing: process                          734
## 3 INFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN REC…          724
## 4 Static information storage and retrieval                             612
## 5 Dynamic magnetic information storage or retrieval                    542
## 6 Error detection/correction and fault detection/recovery              521

Leading innovation categories from 1990 till 2000

results$year<- as.numeric(as.character(results$year))
df4<-results %>% 
  filter(between(year,1990,2000)) %>% 
  select (description,patent_number) %>% 
  group_by(description) %>% 
  dplyr::summarise(patent_count = n_distinct(patent_number)) %>% 
  arrange(desc(patent_count)) %>% 
  top_n(n=10)

## Selecting by patent_count

head(df4)

## # A tibble: 6 x 2
##   description                                                 patent_count
##   <chr>                                                              <int>
## 1 Active solid-state devices (e.g., transistors, solid-state…         6110
## 2 Data processing: database and file management or data stru…         4890
## 3 Electrical computers and digital processing systems: multi…         4298
## 4 INFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN REC…         3850
## 5 Electrical computers and digital processing systems: memory         3776
## 6 Error detection/correction and fault detection/recovery             3720

Leading innovation categories from 2000 till 2010

results$year<- as.numeric(as.character(results$year))
df5<-results %>% 
  filter(between(year, 2000 , 2010 )) %>% 
  select (description,patent_number) %>% 
  group_by(description) %>% 
  dplyr::summarise(patent_count = n_distinct(patent_number)) %>% 
  arrange(desc(patent_count)) %>% 
  top_n(n=10)

## Selecting by patent_count

head(df5)

## # A tibble: 6 x 2
##   description                                                 patent_count
##   <chr>                                                              <int>
## 1 Active solid-state devices (e.g., transistors, solid-state…        17760
## 2 Data processing: database and file management or data stru…        15230
## 3 Electrical computers and digital processing systems: multi…        14619
## 4 Semiconductor device manufacturing: process                        11149
## 5 Multiplex communications                                            9459
## 6 Miscellaneous                                                       8591

Leading innovation categories from 2010 till 2018

results$year<- as.numeric(as.character(results$year))
df6<-results %>% 
  filter(between(year, 2010 , 2018 )) %>% 
  select (description,patent_number) %>% 
  group_by(description) %>% 
  dplyr::summarise(patent_count = n_distinct(patent_number)) %>% 
  arrange(desc(patent_count)) %>% 
  top_n(n=10)

## Selecting by patent_count

head(df6)

## # A tibble: 6 x 2
##   description                                                 patent_count
##   <chr>                                                              <int>
## 1 Miscellaneous                                                      80641
## 2 Recording, communication, or information retrieval equipme…         9497
## 3 Active solid-state devices (e.g., transistors, solid-state…         6048
## 4 Data processing: database and file management or data stru…         4495
## 5 Semiconductor device manufacturing: process                         4357
## 6 Electrical computers and digital processing systems: multi…         4010

Main innovation categories change over time, but “Active solid state devices” is the leading patent category across all periods. Also the “evolution” of data related leading patent categories is the following: starting from data/information storage before 90th, then data processing: database and file management (1990-2020) to Recording, communication, or information retrieval equipment and image processing (from 2010 till present).

Leading innovation categories by companies

Apple

results$year<- as.numeric(as.character(results$year))
df7<-results %>% 
  filter(author =="apple") %>% 
  select (description,patent_number) %>% 
  group_by(description) %>% 
  dplyr::summarise(patent_count = n_distinct(patent_number)) %>% 
  arrange(desc(patent_count)) %>% 
  top_n(n=10)

## Selecting by patent_count

head(df7)

## # A tibble: 6 x 2
##   description                                                 patent_count
##   <chr>                                                              <int>
## 1 Miscellaneous                                                       6755
## 2 Recording, communication, or information retrieval equipme…         1951
## 3 Computer graphics processing and selective visual display …         1594
## 4 Data processing: presentation processing of document, oper…         1317
## 5 Telecommunications                                                   856
## 6 Multiplex communications                                             690

Facebook

results$year<- as.numeric(as.character(results$year))
df8<-results %>% 
  filter(author =="facebook") %>% 
  select (description,patent_number) %>% 
  group_by(description) %>% 
  dplyr::summarise(patent_count = n_distinct(patent_number)) %>% 
  arrange(desc(patent_count)) %>% 
  top_n(n=10)

## Selecting by patent_count

head(df8)

## # A tibble: 6 x 2
##   description                                                 patent_count
##   <chr>                                                              <int>
## 1 Miscellaneous                                                       1640
## 2 Electrical computers and digital processing systems: multi…          221
## 3 Data processing: database and file management or data stru…          128
## 4 Data processing: presentation processing of document, oper…          100
## 5 Data processing: financial, business practice, management,…           98
## 6 Information security                                                  69

Google

results$year<- as.numeric(as.character(results$year))
df9<-results %>% 
  filter(author =="google") %>% 
  select (description,patent_number) %>% 
  group_by(description) %>% 
  dplyr::summarise(patent_count = n_distinct(patent_number)) %>% 
  arrange(desc(patent_count)) %>% 
  top_n(n=10)

## Selecting by patent_count

head(df9)

## # A tibble: 6 x 2
##   description                                                 patent_count
##   <chr>                                                              <int>
## 1 Miscellaneous                                                       9256
## 2 Data processing: database and file management or data stru…         2129
## 3 Electrical computers and digital processing systems: multi…         1116
## 4 Data processing: presentation processing of document, oper…          964
## 5 Data processing: financial, business practice, management,…          770
## 6 Image analysis                                                       650

IBM

results$year<- as.numeric(as.character(results$year))
df10<-results %>% 
  filter(author =="ibm") %>% 
  select (description,patent_number) %>% 
  group_by(description) %>% 
  dplyr::summarise(patent_count = n_distinct(patent_number)) %>% 
  arrange(desc(patent_count)) %>% 
  top_n(n=10)

## Selecting by patent_count

head(df10)

## # A tibble: 6 x 2
##   description                                                 patent_count
##   <chr>                                                              <int>
## 1 Miscellaneous                                                      29930
## 2 Active solid-state devices (e.g., transistors, solid-state…        12781
## 3 Data processing: database and file management or data stru…        10689
## 4 Electrical computers and digital processing systems: multi…         9930
## 5 Semiconductor device manufacturing: process                         8223
## 6 Electrical computers and digital processing systems: memory         7400

Samsung

results$year<- as.numeric(as.character(results$year))
df11<-results %>% 
  filter(author =="samsung") %>% 
  select (description,patent_number) %>% 
  group_by(description) %>% 
  dplyr::summarise(patent_count = n_distinct(patent_number)) %>% 
  arrange(desc(patent_count)) %>% 
  top_n(n=10)

## Selecting by patent_count

head(df11)

## # A tibble: 6 x 2
##   description                                                 patent_count
##   <chr>                                                              <int>
## 1 Miscellaneous                                                      26490
## 2 Active solid-state devices (e.g., transistors, solid-state…        11322
## 3 Semiconductor device manufacturing: process                         7068
## 4 Recording, communication, or information retrieval equipme…         6830
## 5 Computer graphics processing and selective visual display …         4867
## 6 Telecommunications                                                  4640

Intel

results$year<- as.numeric(as.character(results$year))
df12<-results %>% 
  filter(author =="intel") %>% 
  select (description,patent_number) %>% 
  group_by(description) %>% 
  dplyr::summarise(patent_count = n_distinct(patent_number)) %>% 
  arrange(desc(patent_count)) %>% 
  top_n(n=10)

## Selecting by patent_count

head(df12)

## # A tibble: 6 x 2
##   description                                                 patent_count
##   <chr>                                                              <int>
## 1 Active solid-state devices (e.g., transistors, solid-state…         4048
## 2 Electrical computers and digital processing systems: memory         2752
## 3 Electrical computers and digital processing systems: suppo…         2634
## 4 Multiplex communications                                            2587
## 5 Semiconductor device manufacturing: process                         2203
## 6 Electrical computers and digital data processing systems: …         2190

Tencent

results$year<- as.numeric(as.character(results$year))
df13<-results %>% 
  filter(author =="tencent") %>% 
  select (description,patent_number) %>% 
  group_by(description) %>% 
  dplyr::summarise(patent_count = n_distinct(patent_number)) %>% 
  arrange(desc(patent_count)) %>% 
  top_n(n=10)

## Selecting by patent_count

head(df13)

## # A tibble: 6 x 2
##   description                                                 patent_count
##   <chr>                                                              <int>
## 1 Miscellaneous                                                        861
## 2 Recording, communication, or information retrieval equipme…          165
## 3 Electrical computers and digital processing systems: multi…           57
## 4 Information security                                                  19
## 5 Data processing: presentation processing of document, oper…           15
## 6 Data processing: database and file management or data stru…           14

Oracle

results$year<- as.numeric(as.character(results$year))
df14<-results %>% 
  filter(author =="oracle") %>% 
  select (description,patent_number) %>% 
  group_by(description) %>% 
  dplyr::summarise(patent_count = n_distinct(patent_number)) %>% 
  arrange(desc(patent_count)) %>% 
  top_n(n=10)

## Selecting by patent_count

head(df14)

## # A tibble: 6 x 2
##   description                                                 patent_count
##   <chr>                                                              <int>
## 1 Miscellaneous                                                       2854
## 2 Data processing: database and file management or data stru…         2183
## 3 Electrical computers and digital processing systems: multi…         1097
## 4 Data processing: software development, installation, and m…          633
## 5 Electrical computers and digital processing systems: memory          568
## 6 Data processing: presentation processing of document, oper…          446

Hon Hai Precision

results$year<- as.numeric(as.character(results$year))
df15<-results %>% 
  filter(author =="hon_hai") %>% 
  select (description,patent_number) %>% 
  group_by(description) %>% 
  dplyr::summarise(patent_count = n_distinct(patent_number)) %>% 
  arrange(desc(patent_count)) %>% 
  top_n(n=10)

## Selecting by patent_count

head(df15)

## # A tibble: 6 x 2
##   description                                                 patent_count
##   <chr>                                                              <int>
## 1 Electrical connectors                                               4596
## 2 Electricity: electrical systems and devices                         2432
## 3 Miscellaneous                                                       2323
## 4 Recording, communication, or information retrieval equipme…         1013
## 5 Equipment for production, distribution, or transformation …          670
## 6 Supports: cabinet structure                                          646

Microsoft

results$year<- as.numeric(as.character(results$year))
df16<-results %>% 
  filter(author =="microsoft") %>% 
  select (description,patent_number) %>% 
  group_by(description) %>% 
  dplyr::summarise(patent_count = n_distinct(patent_number)) %>% 
  arrange(desc(patent_count)) %>% 
  top_n(n=10)

## Selecting by patent_count

head(df16)

## # A tibble: 6 x 2
##   description                                                 patent_count
##   <chr>                                                              <int>
## 1 Miscellaneous                                                       9038
## 2 Data processing: database and file management or data stru…         5769
## 3 Electrical computers and digital processing systems: multi…         5024
## 4 Data processing: presentation processing of document, oper…         4135
## 5 Recording, communication, or information retrieval equipme…         3567
## 6 Computer graphics processing and selective visual display …         2676

Conclusions

Samsung, IBM and Microsoft are innovation leaders among top 10 tech companies selected for the purpose of this project.
Intel and Microsoft have noticeable decrease in thier innovation pace, where as google demonstrate steady increasing innovation pace.
General innovation pace of all 10 companies clearly show “innovation boom” after 1990.
General innovation trend of top 10 tech companies reflects dot-com crisis as well as financial crisis 2007-2008 with significant down in patent registrations during crisis time.
Main innovation categories change over time, but “Active solid state devices” is the leading patent category across all periods. Also the “evolution” of data related leading patent categories is the following: starting from data/information storage before 90th, then data processing: database and file management (1990-2020) to Recording, communication, or information retrieval equipment and image processing (from 2010 till present).
Innovation direction of tech companies reflects the specific of their businesses:
- Apple, Tencent focus on Recording, communication, or information retrieval equipment
- Facebook apart from data processing contribute to information security
- Google is also focued on data processing and contribute to image analysis
- Samsung, IBM and Intel mostly innovate in hardware
- Hon Hai Precision innovate in electrical system and devices and in energy transformation equipment.
- Microsoft and Oracle - digital processing systems and data processing.