Documenting the methods used in R to count patent applications in comments on Blasiak et al 2018 Corporate control and global governance of marine genetic resources. DOI: 10.1126/sciadv.aar5237

Read in the data.

The raw data is found in sheet 5 in the supplementary material and can be downloaded here http://advances.sciencemag.org/highwire/filestream/204655/field_highwire_adjunct_files/1/aar5237_datafileS1.xlsx

library(readxl)
rawdata <- readxl::read_excel("/Users/pauloldham17inch/Desktop/aar5237_datafileS1.xlsx", sheet = 5)

Warnings are generated by literal N/A in the first column of the data and can be ignored.

View the raw data

Shows the top of the table. Note that the Journal field contains the number, data and applicant data. Company column is dropped for display.

rawdata %>% 
  select(id, Organism, Journal) %>%
  head()
## # A tibble: 6 x 3
##         id Organism                      Journal                          
##      <dbl> <chr>                         <chr>                            
## 1 20589919 methanocaldococcus jannaschii "JOURNAL   Patent: WO 2011053699…
## 2 20589916 pyrococcus abyssi             "JOURNAL   Patent: WO 2011053699…
## 3 20589918 pyrococcus abyssi             "JOURNAL   Patent: WO 2011053699…
## 4 20589920 pyrococcus abyssi             "JOURNAL   Patent: WO 2011053699…
## 5 20589917 pyrococcus furiosus           "JOURNAL   Patent: WO 2011053699…
## 6 20589921 pyrococcus furiosus           "JOURNAL   Patent: WO 2011053699…

Show the column with application numbers e.g. WO 2011053699-A1

rawdata %>% select(Journal)
## # A tibble: 12,998 x 1
##    Journal                                                                
##    <chr>                                                                  
##  1 "JOURNAL   Patent: WO 2011053699-A1 34 05-MAY-2011;\r\n            ABB…
##  2 "JOURNAL   Patent: WO 2011053699-A1 2 05-MAY-2011;\r\n            ABBO…
##  3 "JOURNAL   Patent: WO 2011053699-A1 8 05-MAY-2011;\r\n            ABBO…
##  4 "JOURNAL   Patent: WO 2011053699-A1 36 05-MAY-2011;\r\n            ABB…
##  5 "JOURNAL   Patent: WO 2011053699-A1 5 05-MAY-2011;\r\n            ABBO…
##  6 "JOURNAL   Patent: WO 2011053699-A1 38 05-MAY-2011;\r\n            ABB…
##  7 "JOURNAL   Patent: WO 2010042510-A1 11 15-APR-2010;\r\n            Abb…
##  8 "JOURNAL   Patent: WO 2010042510-A1 15 15-APR-2010;\r\n            Abb…
##  9 "JOURNAL   Patent: WO 2010042510-A1 24 15-APR-2010;\r\n            Abb…
## 10 "JOURNAL   Patent: WO 2010042510-A1 28 15-APR-2010;\r\n            Abb…
## # ... with 12,988 more rows

Separate out the application numbers and identify duplicates

The Journal column contains the application number and other information such as the applicant name with a variety of separators and spacings. The code below separates out the elements and removes junk elements such as extra white space and numbers that are not part of the application number following the kind code e.g. WO 2011053699-A1 36. The 36 is not part of an application number.

library(tidyverse)
applications <- rawdata %>%
  separate(., Journal, into = c("data"), sep = ";", extra = "drop", remove = FALSE) %>% 
  separate(., data, into = c("front", "number"), sep = ":") %>% 
  separate(., number, into = c("extra", "code", "id", "junk", "date"), sep = " ") %>% 
  select(-extra, -junk) %>% 
  mutate(id = str_trim(id, side = "both")) %>% 
  unite(application_number, c("code", "id"), sep = "") %>% 
  mutate(application_number = str_trim(application_number, side = "both")) %>% 
  mutate(duplicated = duplicated(application_number))
head(applications)
## # A tibble: 6 x 7
##   Organism  Journal        front Company application_num… date  duplicated
##   <chr>     <chr>          <chr> <chr>   <chr>            <chr> <lgl>     
## 1 methanoc… "JOURNAL   Pa… JOUR… Abbott… WO2011053699-A1  05-M… FALSE     
## 2 pyrococc… "JOURNAL   Pa… JOUR… Abbott… WO2011053699-A1  05-M… TRUE      
## 3 pyrococc… "JOURNAL   Pa… JOUR… Abbott… WO2011053699-A1  05-M… TRUE      
## 4 pyrococc… "JOURNAL   Pa… JOUR… Abbott… WO2011053699-A1  05-M… TRUE      
## 5 pyrococc… "JOURNAL   Pa… JOUR… Abbott… WO2011053699-A1  05-M… TRUE      
## 6 pyrococc… "JOURNAL   Pa… JOUR… Abbott… WO2011053699-A1  05-M… TRUE

Count up the duplicated numbers. These normally arise when each sequence is linked with a species or more than one species is present in a document.

applications %>%
  count(duplicated)
## # A tibble: 2 x 2
##   duplicated     n
##   <lgl>      <int>
## 1 FALSE        999
## 2 TRUE       11999

There are 999 distinct applications in the 12,998 rows.

Filter the duplicates to identify unique applications

unique <- applications %>% 
  filter(duplicated == FALSE) %>% 
  mutate(date = lubridate::dmy(date)) %>% 
  mutate(year = lubridate::year(date))

Calculate the mean and median

unique_year <- unique %>% 
  count(year)

mean(unique_year$n)
## [1] 41.625
median(unique_year$n)
## [1] 47.5

A mean of 41 filings per year and median of 47.5.

View the count of applications by year

unique_year %>% knitr::kable(align = "c")
year n
1988 1
1991 2
1992 1
1993 2
1998 3
1999 12
2000 15
2001 66
2002 67
2003 42
2004 56
2005 66
2006 67
2007 72
2008 68
2009 90
2010 89
2011 53
2012 54
2013 30
2014 37
2015 49
2016 46
2017 11

Chart the Results

unique_year %>%
  ggplot(., aes(x=year, y = n)) +
  ylim(0,100) +
  geom_point() +
  geom_line() +
  scale_x_continuous(breaks = scales::pretty_breaks(n = 20)) +
  labs(title = "Trends in PCT patent applications for marine genetic sequences in Blasiak et al", x = "application year", y = "applications")

This demonstrates that filings peaked in 2009 and 2010 but the signal is realtively weak and irregular. Note that data for 2016 and 2017 may prove to be incomplete.