Notes on patent counts for Corporate Control and Global Governance of Marine Genetic Resources

Documenting the methods used in R to count patent applications in comments on Blasiak et al 2018 Corporate control and global governance of marine genetic resources. DOI: 10.1126/sciadv.aar5237

Read in the data.

The raw data is found in sheet 5 in the supplementary material and can be downloaded here http://advances.sciencemag.org/highwire/filestream/204655/field_highwire_adjunct_files/1/aar5237_datafileS1.xlsx

library(readxl)
rawdata <- readxl::read_excel("/Users/pauloldham17inch/Desktop/aar5237_datafileS1.xlsx", sheet = 5)

Warnings are generated by literal N/A in the first column of the data and can be ignored.

View the raw data

Shows the top of the table. Note that the Journal field contains the number, data and applicant data. Company column is dropped for display.

rawdata %>% 
  select(id, Organism, Journal) %>%
  head()

## # A tibble: 6 x 3
##         id Organism                      Journal                          
##      <dbl> <chr>                         <chr>                            
## 1 20589919 methanocaldococcus jannaschii "JOURNAL   Patent: WO 2011053699…
## 2 20589916 pyrococcus abyssi             "JOURNAL   Patent: WO 2011053699…
## 3 20589918 pyrococcus abyssi             "JOURNAL   Patent: WO 2011053699…
## 4 20589920 pyrococcus abyssi             "JOURNAL   Patent: WO 2011053699…
## 5 20589917 pyrococcus furiosus           "JOURNAL   Patent: WO 2011053699…
## 6 20589921 pyrococcus furiosus           "JOURNAL   Patent: WO 2011053699…

Show the column with application numbers e.g. WO 2011053699-A1

rawdata %>% select(Journal)

## # A tibble: 12,998 x 1
##    Journal                                                                
##    <chr>                                                                  
##  1 "JOURNAL   Patent: WO 2011053699-A1 34 05-MAY-2011;\r\n            ABB…
##  2 "JOURNAL   Patent: WO 2011053699-A1 2 05-MAY-2011;\r\n            ABBO…
##  3 "JOURNAL   Patent: WO 2011053699-A1 8 05-MAY-2011;\r\n            ABBO…
##  4 "JOURNAL   Patent: WO 2011053699-A1 36 05-MAY-2011;\r\n            ABB…
##  5 "JOURNAL   Patent: WO 2011053699-A1 5 05-MAY-2011;\r\n            ABBO…
##  6 "JOURNAL   Patent: WO 2011053699-A1 38 05-MAY-2011;\r\n            ABB…
##  7 "JOURNAL   Patent: WO 2010042510-A1 11 15-APR-2010;\r\n            Abb…
##  8 "JOURNAL   Patent: WO 2010042510-A1 15 15-APR-2010;\r\n            Abb…
##  9 "JOURNAL   Patent: WO 2010042510-A1 24 15-APR-2010;\r\n            Abb…
## 10 "JOURNAL   Patent: WO 2010042510-A1 28 15-APR-2010;\r\n            Abb…
## # ... with 12,988 more rows

Separate out the application numbers and identify duplicates

The Journal column contains the application number and other information such as the applicant name with a variety of separators and spacings. The code below separates out the elements and removes junk elements such as extra white space and numbers that are not part of the application number following the kind code e.g. WO 2011053699-A1 36. The 36 is not part of an application number.

library(tidyverse)
applications <- rawdata %>%
  separate(., Journal, into = c("data"), sep = ";", extra = "drop", remove = FALSE) %>% 
  separate(., data, into = c("front", "number"), sep = ":") %>% 
  separate(., number, into = c("extra", "code", "id", "junk", "date"), sep = " ") %>% 
  select(-extra, -junk) %>% 
  mutate(id = str_trim(id, side = "both")) %>% 
  unite(application_number, c("code", "id"), sep = "") %>% 
  mutate(application_number = str_trim(application_number, side = "both")) %>% 
  mutate(duplicated = duplicated(application_number))

head(applications)

## # A tibble: 6 x 7
##   Organism  Journal        front Company application_num… date  duplicated
##   <chr>     <chr>          <chr> <chr>   <chr>            <chr> <lgl>     
## 1 methanoc… "JOURNAL   Pa… JOUR… Abbott… WO2011053699-A1  05-M… FALSE     
## 2 pyrococc… "JOURNAL   Pa… JOUR… Abbott… WO2011053699-A1  05-M… TRUE      
## 3 pyrococc… "JOURNAL   Pa… JOUR… Abbott… WO2011053699-A1  05-M… TRUE      
## 4 pyrococc… "JOURNAL   Pa… JOUR… Abbott… WO2011053699-A1  05-M… TRUE      
## 5 pyrococc… "JOURNAL   Pa… JOUR… Abbott… WO2011053699-A1  05-M… TRUE      
## 6 pyrococc… "JOURNAL   Pa… JOUR… Abbott… WO2011053699-A1  05-M… TRUE

Count up the duplicated numbers. These normally arise when each sequence is linked with a species or more than one species is present in a document.

applications %>%
  count(duplicated)

## # A tibble: 2 x 2
##   duplicated     n
##   <lgl>      <int>
## 1 FALSE        999
## 2 TRUE       11999

There are 999 distinct applications in the 12,998 rows.

Filter the duplicates to identify unique applications

unique <- applications %>% 
  filter(duplicated == FALSE) %>% 
  mutate(date = lubridate::dmy(date)) %>% 
  mutate(year = lubridate::year(date))

Calculate the mean and median

unique_year <- unique %>% 
  count(year)

mean(unique_year$n)

## [1] 41.625

median(unique_year$n)

## [1] 47.5

A mean of 41 filings per year and median of 47.5.

View the count of applications by year

unique_year %>% knitr::kable(align = "c")

year	n
1988	1
1991	2
1992	1
1993	2
1998	3
1999	12
2000	15
2001	66
2002	67
2003	42
2004	56
2005	66
2006	67
2007	72
2008	68
2009	90
2010	89
2011	53
2012	54
2013	30
2014	37
2015	49
2016	46
2017	11

Chart the Results

unique_year %>%
  ggplot(., aes(x=year, y = n)) +
  ylim(0,100) +
  geom_point() +
  geom_line() +
  scale_x_continuous(breaks = scales::pretty_breaks(n = 20)) +
  labs(title = "Trends in PCT patent applications for marine genetic sequences in Blasiak et al", x = "application year", y = "applications")

This demonstrates that filings peaked in 2009 and 2010 but the signal is realtively weak and irregular. Note that data for 2016 and 2017 may prove to be incomplete.