Introduction

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com. This demonstrates reproducibility and the steps taken to capture, parse and explore the dataset to enable graph analytics.

Executive Summary

The report analyses the patent authors from a leading Telecommunications provider which have been filled with the EPO to detect communities within the graph but along the way, the report highlights a number of source data issues. A method for data cleansing using the graph is detailed as a way to self heal graphs that have similar issues. There are over 1700 patents being analysed with more than 1600 unique author names listed.

There exists a small group of authors from this company who have considerably high degree (co-authors of patents) however the community has a whole has a high clustering coefficient compared to an equivalent ER model and this report explains why.

Data Capture

The European Patent Office (EPO) provides an interactive means for obtaining data on a trial basis from https://data.epo.org/expert-services/start.html for mining patent fillings.

The data set can be obtained by applying a search using “app= vodafone*" with an extract definition using “Publication Number,Title,Assignee(s),Inventor(s),Priority Date,Priority Country”. This applicant has been selected for personal reasons and represents a manageable data-set and all authors connected to a patent will be included.

There is little to no previous exploration using graph theory on this data-set, primarily due to the difficulties in obtaining data and for the reasons highlighted below, this I beleive justifies it being an interesting and original data-set to use.

Parsing Data

For the patent “Title” this can contain multiple listings supporting different languages, for this study we will retain only the English title (as this language is common to the target audience). The priority date field also contains multiple dates linked to the countries where the patent was filled; for this analysis we will preserve the first published date for each record and use the English title.

# Read patent data from zipped/csv file
patents <- read.csv( unzip("data/raw/epvoda_csv.zip"), 
                     nrows=-1, stringsAsFactors=FALSE, encoding="UTF8")
# Retain assignees matching VODAF and VODAPH (one incorrectly spelt name in data-set)
patents <- patents[ grepl( "VODAF|VODAPH", patents[,3] ) , ] 
# Preserve 1st/Englist title
patents[,2] <- str_replace(  str_extract(patents[,2], ".*?\\||.*?$") , " \\|", "")
# Preserve 1st publication date
patents[,5] <- str_replace(  str_extract(patents[,5], ".*?\\||.*?$") , " \\|", "")

There are 1713 unique patents listed for analysis.

There are 1665 distinct author names in the data-set.

Graph Analysis

Data Conversion

The data is re-organised for exploration using graph methods with the relationships being generated.

edges <- NULL
nodes <- cbind(allAuthors,allAuthors)
for (i in 1:nrow(patents)) {
  # Extract authors
  pAuthors <- str_replace( unlist(str_extract_all( patents[i,4] , ".*?\\||.*?$") ) , " \\|", "") 
  pAuthors <- gsub("^\\s+|\\s+$", "", pAuthors) #Remove leading and trailing spaces
  if (length(pAuthors)!=1) { # To measure co-author relationships we need more than one author
    t2 <- t(combn(pAuthors,2))
    edges <- rbind(edges, t2 )
  }
}

Graph for Gephi

A graph based on the nodes is created for consumption in Gephi, however analysis of the graph is done via both programming and data-mining tool.

df <- edge.list( as.data.frame(edges) )
write.gexf(nodes=df$nodes, edges=df$edges, output="data/graph/epvoda.gexf") 
## GEXF graph successfully written at:
## C:\development\workspace-r\SNA_Patents_Project\data\graph\epvoda.gexf
write.table(edges,file="data/clean/edges.csv", row.names=FALSE, col.names = FALSE)
# Create simple graph
g <- simplify(graph.data.frame(edges, directed=FALSE))

Graph Calculations

Calculate degree centrality for all nodes, degree-in and degree-out. I also measure the average shortest path through the graph and identiy which author is ‘central’ to the graph along with their top-5 neighbours. Additional calculations are performed and documented below (diameter, ratio and average shortest-path).

# Calculate degree for all nodes
 degrees = degree(g, mode="all")
# Author with the Highest Degrees
 degrees[which.max(degrees)]
## DE, PASQUALE, ANDREA 
##                   36
# calculate the in and out degrees
 id = degree(g,mode="in")
 od = degree(g,mode="out")
 id[which.max(id)] # Highest in-degree
## DE, PASQUALE, ANDREA 
##                   36
 od[which.max(od)] # Highest out-degree
## DE, PASQUALE, ANDREA 
##                   36
# Calculate betweeness scores (highest shortests paths through node)
 bet = betweenness(g,V(g),directed=FALSE)
 which.max(bet) # Which author has the highest betweenness scores
## FOX, DAVI 
##        23
# And the neighbours of the betweeness node
bet[neighbors( g, which.max(bet) )] [1:5] # Restricted to top 5 to same space
## PUDNEY, CHRISTOPHER          FOX, DAVID  PUDNEY, CHRISTOPHE 
##           15120.559            9770.135           26626.144 
##          WONG, GAVI           FROST, TI 
##           71601.027           41667.873
length(bet[neighbors( g, which.max(bet) )]) # Total number of Neighbours of 'betweenness node'
## [1] 30
# Shows the high betweeness node is not the same node that has the highest degree.

This shows the author Andrea De Pasquale as be the person who has the most co-authors (36) on patents with the histogram below showing the distribution of degrees from the graph,. This illustrates that 36 is an extreme value in comparison to the average degree of 5 whereby the majority of authors are more likely to only ever work with 5 or less other authors in their lifetime.

However our author with the highest betweeness is David Fox (discussed later in the report) and he would represent one of the best authors to start any viral communication or messages for patent related information amongst the inventor community (as all shortest communication pathways run through him), this is supported as he also has a higher than average co-author score (degrees).

hist(degrees, col=rainbow(8) , main="Frequency of co-authors", ylab="Frequency")

Their exists in the graph a number of islands (3 authors whom have only ever worked with each other) but also lone authors, e.g. those that are not connected to anyone else (109 in total with the calculations shown below).

allCoAuthors <- unique(cbind(edges[,1],edges[,2]))
loneAuthors <- setdiff(allAuthors,allCoAuthors)
length(loneAuthors) # Number of lone authors (un-connected)
## [1] 109

This may be explained as some patents have individuals from other organisations list as co-authors and thus they are less likely to appear in this commuity. However this segment represent ~14% so it’s accepted there would be individuals within in the company that are filling patents as individuals.

Communities (fast.greedy)

One algorithm on the course that was not discussed was the fast.greedy algorithm for community detection. This algorithm tries to find dense subgraphs via directly optimizing the modularity score. This function is provided by the igraph library. It detects 154 ‘communities’ within the data-set and the graph illustrates the community size distribution below.

fc <- fastgreedy.community(g)
length(sizes(fc)) # The number of dense communities
## [1] 154
colors <- rainbow(max(membership(fc)))
#plot(g,vertex.color=colors[membership(fc)], vertex.label=NA, vertex.size=4,
#     layout=layout.fruchterman.reingold,
#     main="Plot of communities")

Author Communities

The community plot has been shown in Gephi but the igraph plot code is shown below but it’s less aesthetically pleasing. The chart clearly shows the small communities and how they link to others. This visually shows those groups of people who ‘commonly’ work together on patents.

Algorithm acknowledgement: A Clauset, MEJ Newman, C Moor

Gephi

Gephi reports an average degree of 5.088 and when measuring Network Diameter (maximum eccentricity) it reports 20, with a Radius of 1 and average path length of 8.758.

In comparison to a generated ER graph which yields an Average Shortest Path of 4.760 (almost half of the Vodafone equivalent) but with significantly fewer triangles (low clustering coefficient) in the data-set (20 vs 7091). ER Graph generated with a rewire probability of 0.003269916 = 3961 / ( ( 1557 x 1556 ) /2 )

This suggests a highly collaborative network. This is true, which has business financial benefits to authors which only starts to numerically degrade above 3 authors per patent. Therefore this explains why there is a high clustering coefficent (higher number triangles/triads) found in this dataset as a reward to patent authors does not reduce with three or below authors involved. This is an important point and is the basis for why the graph has a high clustering coefficient.

The MCL (experimental) clustering algorithm detects 12 clusters in the data-set, initial analysis suggests a strong geographic split and with further analysis it’s expected that they would be related by topic areas (areas of knowledge). At this point you could identify the influncer in each sub-graph for dispersing inventor communications to each geopgraphically split community (like a regional lead appointed on fact and not popularity, a principle of a data driven company).

The image below shows the workings within the Gephi tool.

Relationships of Patent authors

Another metric during the course that was not explored was Density of a Graph which the author graph is recorded as 0.003. This measures how complete a graph is against all possible edges. This means the graph is highly connected with <1% of edges yet to be established. A fact supported by previous measures.

Limitations

Having explored the dataset it was noticeable that names of inventors could be misspelt and thus could be considered a new author instead of a repeat sighting of a previous author.

This is apparent in the extreme real example below, where the authors name is “Dr David Andrew Fox” but several alternative ways (7 in total):-

This could lead to incorrect statements made about the data-set as each one is a node in it’s own right. The frequency for each is shown below.

barplot(authorFox,col=rainbow(7),cex.axis=0.5, cex.names=0.5, 
        main="Data Quality issues for an Authors Name", ylab="Frequency",las=3)

Although this shows an extreme example of the error the underlying connectivity in the graph would only improve and thus the overall outcome of a highly connected graph remains constant. Fixing this data issue becomes a complex problem as graphs are not ideal for this type of operation, but it is ideal for identifying and deriving the fix.

So additional untaught new idea here is to use graph methods to cleanse the graph (the nodes). The theory being that authors are more likely to have co-authors who have worked together on several occasions and thus by examining these relationshps it provides an efficient approach for comparing and mathching nodes back to the real-authors node and thus attempt to de-duplicate alternatively named authors nodes (and repair the model). Based on a minimum of 86% of the community have worked with another author.

The reverse may be possible but will not be explored here whereby, multiple individuals with the same name (“John Smith” in England is a common name) could be factored back out instead of being seen as the same person, by examining the connections and if they appear un-related or distant (not a triad) then it’s unlikely to be the same person and thus exist as two entries in the data-set. This is particularly important as the size of the dataset being examined increases the chances of two individuals with the same name increases and one author could be contributed with anothers work by accident.

Exploratory work suggests the function adist which implements an algorithm designed by Levenshtein as a distance metric would be ideal for comparing names and linking those together below a threshold ~7. I plan to revist the above method in the future outside of this assignment.

Data Quality

Data quality is a serious issue, for example it’s detected in the initial data discovery an assignee name of “VODAPHONE LIMITED” clearly an employee or service provider mean’t to specify VODAFONE. In a lazy search for %VODAFONE% this would have been missed a number of patents (this was identifed, captured and included in this exercise).

Data Cleanup

From the above exploratory work a number of issues exist with the inconsistent naming and processing errors impacting authors names, those are:-

  1. County of author may be appended “(GB)”
  2. Inclusion of Middle name and/or initial of middle name
  3. Last character in forename and middle name is often missing
  4. Company of Author has been included on more than one occasion “TURK, JOHNVODAFONE, GROUP, PLC.GROUP, TECHNOLOGY”
  5. Occasional the “form of address” is included (DR. and/or Engineers Degree “DIPL.-ING.”) both at start and end of name field

Conclusion

Identifying the influencers within a graph can be important especially for passing messages through a company not tied together by a organisation chart, but linked by knowledge and skills. A technical message from a respected peer may carry more substance and credibility than a message distributed through other channels.

We have discovered a high cluster coefficient and using different community algorithms detected a geographical split and also those individuals that appear to consistently work together during their lifetime. It’s also identified that mis-information would spread quickly in this network as it’s a highly collaborative network.

All information in this report is publicly available information from the resources cited.