UK Company Index for Access and Benefit-Sharing
Paul Oldham
The UK Company Index for ABS
Paul Oldham
Simon Industrial Fellow
Manchester Institute for Innovation Research
Alliance Manchester Business School
Background
The Question: How to independently identify companies and organisations utilising genetic resources (and associated traditional knowledge) in the UK?
Our Solution: Build an index of companies/organisations who have filed a patent application that:
- contains a reference to genetic material (a species name), and;
- contains a UK applicant or inventor.
We could do this because we had already indexed 14 million patent documents (1975-2013) for species names as reported earlier in a PLOS ONE article.
Why Patent Data?
- Patent applications are an indicator of investments in R&D (and among the best indicators there are) BUT
- Working with patent data at the level of millions requires sophisticated programming and text mining skills;
- Companies may refer to a species either because it is utilized or because it is a target for an invention (e.g. to deal with a pathogen or pest);
- There is a lag time of 2 years between submission of an application and its publication;
- Mapping patent applicant names into other data sources (company registers etc.) is hard but improving.
Methods
We extracted the patent applicant names from the patent documents and then performed a lookup using a combination of:
- UK Companies House data
- Open Corporates
- Manual and automated web searching
Our aim was to obtain registration numbers, postal addresses, telephone numbers, email addresses for the entities so that they could be contacted by DEFRA and/or the UK regulator.
All data was cross-checked but proved to be very messy to work with..
Outcomes
- A main table excel sheet and supplementary tables;
- A simple online map;
- 600 (of 1525) companies and organisations emailed for the UK consultation on ratification;
- An interactive online App (that allows companies to be looked up).
Data Issues
We had aimed to automate the entire process but the data was very MESSY and required extensive cross-checking by a combined team of 5 people:
- Names are messy (really messy);
- Companies merge, demerge, cease to operate…(partly identifiable in company registers);
- Companies/organizations may have multiple legal personalities;
- Standard Industrial Classification (SIC) codes used by UK Companies House and for national statistics proved useless as a means for exploring data (e.g. pharmaceutical companies headquarters are typically listed as 70100 - Activities of head offices not as pharmaceuticals).
Sector Issues
Identifying sectors proved time consuming because the SIC data was so poor. We adopted an approach of:
- Manually reviewing websites to identify what entities said they did;
- Identifying single phrases for major sectors (e.g. Agriculture, Pharmaceuticals etc.);
- Editing for noise (false positives) from our species name matching algorithm using a patent traceback table;
- A need to link sector categories to wider statistical categories (perhaps at a later stage);
At the end of all of this… we all needed a cocktail on a nice beach somewhere.
Patent Data
- Applications linked to countries and persons from the EU accessible through PATSTAT (the World Patent Statistical Database) which EUROSTAT was involved in establishing;
- Text mining required to update species data after 2013;
- Elaboration of statistical trends in patent activity for the EU and broken down by member state.
Company Organisation Data
- Look at what OECD has already done on harmonising patent and company names LOOK UP.
- Very human resource intensive to tidy;
- EU languages and name forms will require a concentrated effort (maybe led by research organisations in a member state as part of a consortium);
- Once a list is established a need for maintenance and update;
- Approach should meet the practical needs of focal points/regulators;
- Approach should facilitate reporting under NP.
Declarations of Due Diligence
Results
1,525 legal entities in x sectors.
#ggplot of companies by sector goes here
Legal Personalities
# companies with the same parent by code
Lessons Learned
This task is much more difficult than it originally appears!!!
- Information gathering can be automated but information processing requires human resources and is time consuming;
- Maintaining or updating an index will be labour intensive but could be done once a year;
- A patent approach focuses on what people do (R&D investments) but may not capture sectors who do not actively pursue patents or that rely on trade secrecy or simple secrecy (illicit trade);
- Indexing should not be seen as a tool for persectuing companies. It casts light on the role of genetic resources in national economies, the diversity of innovative activity involving biodiversity, and is a tool for engagement between regulators and companies/organisations.
Data Cleaning
- The original index contained company names with mixed all capitals or camel case. These were regularised to lower and then to camel (Proper case). Names were reviewed and in cases where it was clear that a company name was an abbreviation e.g. Ici is ICI the capitals were restored.
- The term Limited may form part of a formal company name and therefore appears as Limited or Ltd.
- Full stops as part of a name were left as is. All others were removed.
Noise Issues
Why is Skype in there?