Documenting the methods used in R to count patent applications in comments on Blasiak et al 2018 Corporate control and global governance of marine genetic resources. DOI: 10.1126/sciadv.aar5237
The raw data is found in sheet 5 in the supplementary material and can be downloaded here http://advances.sciencemag.org/highwire/filestream/204655/field_highwire_adjunct_files/1/aar5237_datafileS1.xlsx
library(readxl)
rawdata <- readxl::read_excel("/Users/pauloldham17inch/Desktop/aar5237_datafileS1.xlsx", sheet = 5)
Warnings are generated by literal N/A in the first column of the data and can be ignored.
Shows the top of the table. Note that the Journal field contains the number, data and applicant data. Company column is dropped for display.
rawdata %>%
select(id, Organism, Journal) %>%
head()
## # A tibble: 6 x 3
## id Organism Journal
## <dbl> <chr> <chr>
## 1 20589919 methanocaldococcus jannaschii "JOURNAL Patent: WO 2011053699…
## 2 20589916 pyrococcus abyssi "JOURNAL Patent: WO 2011053699…
## 3 20589918 pyrococcus abyssi "JOURNAL Patent: WO 2011053699…
## 4 20589920 pyrococcus abyssi "JOURNAL Patent: WO 2011053699…
## 5 20589917 pyrococcus furiosus "JOURNAL Patent: WO 2011053699…
## 6 20589921 pyrococcus furiosus "JOURNAL Patent: WO 2011053699…
Show the column with application numbers e.g. WO 2011053699-A1
rawdata %>% select(Journal)
## # A tibble: 12,998 x 1
## Journal
## <chr>
## 1 "JOURNAL Patent: WO 2011053699-A1 34 05-MAY-2011;\r\n ABB…
## 2 "JOURNAL Patent: WO 2011053699-A1 2 05-MAY-2011;\r\n ABBO…
## 3 "JOURNAL Patent: WO 2011053699-A1 8 05-MAY-2011;\r\n ABBO…
## 4 "JOURNAL Patent: WO 2011053699-A1 36 05-MAY-2011;\r\n ABB…
## 5 "JOURNAL Patent: WO 2011053699-A1 5 05-MAY-2011;\r\n ABBO…
## 6 "JOURNAL Patent: WO 2011053699-A1 38 05-MAY-2011;\r\n ABB…
## 7 "JOURNAL Patent: WO 2010042510-A1 11 15-APR-2010;\r\n Abb…
## 8 "JOURNAL Patent: WO 2010042510-A1 15 15-APR-2010;\r\n Abb…
## 9 "JOURNAL Patent: WO 2010042510-A1 24 15-APR-2010;\r\n Abb…
## 10 "JOURNAL Patent: WO 2010042510-A1 28 15-APR-2010;\r\n Abb…
## # ... with 12,988 more rows
The Journal column contains the application number and other information such as the applicant name with a variety of separators and spacings. The code below separates out the elements and removes junk elements such as extra white space and numbers that are not part of the application number following the kind code e.g. WO 2011053699-A1 36. The 36 is not part of an application number.
library(tidyverse)
applications <- rawdata %>%
separate(., Journal, into = c("data"), sep = ";", extra = "drop", remove = FALSE) %>%
separate(., data, into = c("front", "number"), sep = ":") %>%
separate(., number, into = c("extra", "code", "id", "junk", "date"), sep = " ") %>%
select(-extra, -junk) %>%
mutate(id = str_trim(id, side = "both")) %>%
unite(application_number, c("code", "id"), sep = "") %>%
mutate(application_number = str_trim(application_number, side = "both")) %>%
mutate(duplicated = duplicated(application_number))
head(applications)
## # A tibble: 6 x 7
## Organism Journal front Company application_num… date duplicated
## <chr> <chr> <chr> <chr> <chr> <chr> <lgl>
## 1 methanoc… "JOURNAL Pa… JOUR… Abbott… WO2011053699-A1 05-M… FALSE
## 2 pyrococc… "JOURNAL Pa… JOUR… Abbott… WO2011053699-A1 05-M… TRUE
## 3 pyrococc… "JOURNAL Pa… JOUR… Abbott… WO2011053699-A1 05-M… TRUE
## 4 pyrococc… "JOURNAL Pa… JOUR… Abbott… WO2011053699-A1 05-M… TRUE
## 5 pyrococc… "JOURNAL Pa… JOUR… Abbott… WO2011053699-A1 05-M… TRUE
## 6 pyrococc… "JOURNAL Pa… JOUR… Abbott… WO2011053699-A1 05-M… TRUE
Count up the duplicated numbers. These normally arise when each sequence is linked with a species or more than one species is present in a document.
applications %>%
count(duplicated)
## # A tibble: 2 x 2
## duplicated n
## <lgl> <int>
## 1 FALSE 999
## 2 TRUE 11999
There are 999 distinct applications in the 12,998 rows.
unique <- applications %>%
filter(duplicated == FALSE) %>%
mutate(date = lubridate::dmy(date)) %>%
mutate(year = lubridate::year(date))
unique_year <- unique %>%
count(year)
mean(unique_year$n)
## [1] 41.625
median(unique_year$n)
## [1] 47.5
A mean of 41 filings per year and median of 47.5.
unique_year %>% knitr::kable(align = "c")
| year | n |
|---|---|
| 1988 | 1 |
| 1991 | 2 |
| 1992 | 1 |
| 1993 | 2 |
| 1998 | 3 |
| 1999 | 12 |
| 2000 | 15 |
| 2001 | 66 |
| 2002 | 67 |
| 2003 | 42 |
| 2004 | 56 |
| 2005 | 66 |
| 2006 | 67 |
| 2007 | 72 |
| 2008 | 68 |
| 2009 | 90 |
| 2010 | 89 |
| 2011 | 53 |
| 2012 | 54 |
| 2013 | 30 |
| 2014 | 37 |
| 2015 | 49 |
| 2016 | 46 |
| 2017 | 11 |
unique_year %>%
ggplot(., aes(x=year, y = n)) +
ylim(0,100) +
geom_point() +
geom_line() +
scale_x_continuous(breaks = scales::pretty_breaks(n = 20)) +
labs(title = "Trends in PCT patent applications for marine genetic sequences in Blasiak et al", x = "application year", y = "applications")
This demonstrates that filings peaked in 2009 and 2010 but the signal is realtively weak and irregular. Note that data for 2016 and 2017 may prove to be incomplete.