This is an R Markdown document describing an initial analysis of the patent move sequences coded by Nick Groom for a sample of UK patents from 1734 to 2011, one randomly selected patent from each year. The goal is to track change over time in the discursive structure of British patents primarily using string edit distance. Final methods and results to be reported in paper by Nick Groom and Jack Grieve.
The main package used for this analysis is stringdist, which contains various string matching and string distance functions, as are commonly used in computational linguistics, dialectometry, genetics, etc. Documentation is available at https://cran.r-project.org/web/packages/stringdist/stringdist.pdf. We also use dendextend for dendrogram plotting and knitr for table output.
library(stringdist)
library(dendextend)
library(knitr)
The dataset consists of a string of characters, representing a sequence of rhetorical moves in a single randomly selected patent, for 276 years between 1734 and 2011. We lack data for 1739 and 1758 because there was no patents available for those two years. The underlying dataset also includes codes from 1711 to 1733, but because there are only 4 years with data in that span, we have excluded them from this preliminary analysis.
In these strings each character represents 1 of 28 individual moves as listed in the table below.
gloss <- read.table("gloss.txt", header = TRUE, sep = "\t")
kable(gloss)
| Code | Move |
|---|---|
| a | Statement that monarch has granted the applicant patent protection for their invention |
| b | statement of condition of grant |
| c | Declaration of invention (usually noting that this is in fulfilment of condition of grant) |
| d | Description of invention |
| e | witness statement and signature |
| f | confirmation that specification has been lodged within prescribed time limit |
| g | Salutation |
| h | Statement of petition to monarch (telling the monarch about their invention and/or asking for patent protection for invention) |
| i | Other witness signature(s) |
| j | Drawings |
| k | Declaration that specification is “a full, true, and perfect description of our said Invention” |
| l | Restricting the scope of the invention (in relation to existing technology/practice) |
| m | Statement of claims (e.g. “Having thus described my invention, what I claim as new is …”) |
| n | Royal Arms |
| o | Filing information (date of patent, patent number, etc.) |
| p | Title of patent |
| q | Declaration of grant of patent (stating inventor name, invention name and date of grant and seal) |
| r | Identification of provisional specification (1952-83 includes identifying inventor and where & when specification was lodged; 1884- just title PROVISIONAL SPECIFICATION) |
| s | Provisional declaration of invention |
| t | Brief description of invention |
| u | Indentification of complete specification (in pursuance of condition of grant (so incorporating ‘c’ in pre-1852 specs), identifying inventor and where & when specification was lodged; 1884- just title saying COMPLETE SPECIFICATION; from 1978, just ‘Specification’) |
| v | Description of the drawings (must be clearly separated from main or other part of spec. May have various titles or no title at all but needs to be separate) |
| w | “A communication from [overseas inventor or corporation]” |
| x | Patent number |
| y | Title of document: ‘Patent Specification’ |
| z | Abstract |
| 1 | Abstract title |
| 2 | Title page |
We read in the data, which consists of two columns: a year and string.
patents <- read.table("PATENT_rev.csv", header = TRUE, sep = ",")
patents
There is considerable diversity in the codes, with 74 different sequence types attested.
length(table(patents$CODE))
## [1] 73
With some sequences being repeated up to 23 times.
summary(patents)
## YEAR CODE
## Min. :1734 xyop1zxjdm : 23
## 1st Qu.:1805 gabcdef : 18
## Median :1874 ynoxupcdmej : 17
## Mean :1873 gabcdeif : 15
## 3rd Qu.:1942 gabcdmeifj : 15
## Max. :2011 norpsteupcdmej: 12
## (Other) :176
Overall, the frequency distribution of sequences shows a quick fall, with a only a relatively small number of sequences accounting for a substantial number of tokens. This, however, does not take into account that some of these distinct sequences are fairly similar to each other, which is one reason why we look at string distance later.
codefreq <- as.data.frame(table(patents$CODE))
codefreq <- codefreq[order(codefreq$Freq, decreasing = TRUE), ]
barplot(codefreq$Freq, names.arg = codefreq$Var1, main = "Patent Move Sequences",
ylab = "Freqency", las = 2, cex.names = 0.4, col = "red")
Before comparing strings, it is interesting to look at the frequency of each of the individual codes, regardless of position, which we do here by counting the frequency of each move code across all sequences.
codes <- c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m",
"n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "1", "2")
movefreq <- data.frame()
for (i in c(1:length(codes))) {
movefreq[i, "CODE"] <- codes[i]
movefreq[i, "COUNT"] <- length(grep(codes[i], patents$CODE))
}
movefreq
We can also plot the frequency distribution, showing a gradual drop in the percentage of patents contain a particular move.
movefreqsort <- movefreq[order(movefreq$COUNT, decreasing = TRUE), ]
barplot((movefreqsort$COUNT/276), names.arg = movefreqsort$CODE, main = "Patent Move",
ylab = "Freqency", cex.names = 0.5, col = "red")
movemat <- data.frame()
for (i in c(1:nrow(patents))) {
movemat[i, "YEAR"] <- patents$YEAR[i]
for (j in c(1:length(codes))) {
movemat[i, codes[j]] <- length(grep(codes[j], patents$CODE[i]))
}
}
movemat
We also look at when the moves were used over time, again ignoring where the moves occur in the patents. Notably, these graphs often show extremely abrupt changes, showing that introduction or loss of specific moves tends not to occur slowly over multiple years. There are some exception, such as Move J (inclusion of drawings), which shows a very gradual increase.
plot(movemat$YEAR, movemat$a, type = "h", col = "red", yaxt = "n", ylab = "",
xlab = "", main = "Occurences of Move A by Year")
plot(movemat$YEAR, movemat$b, type = "h", col = "red", yaxt = "n", ylab = "",
xlab = "", main = "Occurences of Move B by Year")
plot(movemat$YEAR, movemat$c, type = "h", col = "red", yaxt = "n", ylab = "",
xlab = "", main = "Occurences of Move C by Year")
plot(movemat$YEAR, movemat$d, type = "h", col = "red", yaxt = "n", ylab = "",
xlab = "", main = "Occurences of Move D by Year")
plot(movemat$YEAR, movemat$e, type = "h", col = "red", yaxt = "n", ylab = "",
xlab = "", main = "Occurences of Move E by Year")
plot(movemat$YEAR, movemat$f, type = "h", col = "red", yaxt = "n", ylab = "",
xlab = "", main = "Occurences of Move F by Year")
plot(movemat$YEAR, movemat$g, type = "h", col = "red", yaxt = "n", ylab = "",
xlab = "", main = "Occurences of Move G by Year")
plot(movemat$YEAR, movemat$h, type = "h", col = "red", yaxt = "n", ylab = "",
xlab = "", main = "Occurences of Move H by Year")
plot(movemat$YEAR, movemat$i, type = "h", col = "red", yaxt = "n", ylab = "",
xlab = "", main = "Occurences of Move I by Year")
plot(movemat$YEAR, movemat$j, type = "h", col = "red", yaxt = "n", ylab = "",
xlab = "", main = "Occurences of Move J by Year")
plot(movemat$YEAR, movemat$k, type = "h", col = "red", yaxt = "n", ylab = "",
xlab = "", main = "Occurences of Move K by Year")
plot(movemat$YEAR, movemat$l, type = "h", col = "red", yaxt = "n", ylab = "",
xlab = "", main = "Occurences of Move L by Year")
plot(movemat$YEAR, movemat$m, type = "h", col = "red", yaxt = "n", ylab = "",
xlab = "", main = "Occurences of Move M by Year")
plot(movemat$YEAR, movemat$n, type = "h", col = "red", yaxt = "n", ylab = "",
xlab = "", main = "Occurences of Move N by Year")
plot(movemat$YEAR, movemat$o, type = "h", col = "red", yaxt = "n", ylab = "",
xlab = "", main = "Occurences of Move O by Year")
plot(movemat$YEAR, movemat$p, type = "h", col = "red", yaxt = "n", ylab = "",
xlab = "", main = "Occurences of Move P by Year")
plot(movemat$YEAR, movemat$q, type = "h", col = "red", yaxt = "n", ylab = "",
xlab = "", main = "Occurences of Move Q by Year")
plot(movemat$YEAR, movemat$r, type = "h", col = "red", yaxt = "n", ylab = "",
xlab = "", main = "Occurences of Move R by Year")
plot(movemat$YEAR, movemat$s, type = "h", col = "red", yaxt = "n", ylab = "",
xlab = "", main = "Occurences of Move S by Year")
plot(movemat$YEAR, movemat$t, type = "h", col = "red", yaxt = "n", ylab = "",
xlab = "", main = "Occurences of Move T by Year")
plot(movemat$YEAR, movemat$u, type = "h", col = "red", yaxt = "n", ylab = "",
xlab = "", main = "Occurences of Move U by Year")
plot(movemat$YEAR, movemat$v, type = "h", col = "red", yaxt = "n", ylab = "",
xlab = "", main = "Occurences of Move V by Year")
plot(movemat$YEAR, movemat$w, type = "h", col = "red", yaxt = "n", ylab = "",
xlab = "", main = "Occurences of Move W by Year")
plot(movemat$YEAR, movemat$x, type = "h", col = "red", yaxt = "n", ylab = "",
xlab = "", main = "Occurences of Move X by Year")
plot(movemat$YEAR, movemat$y, type = "h", col = "red", yaxt = "n", ylab = "",
xlab = "", main = "Occurences of Move Y by Year")
plot(movemat$YEAR, movemat$z, type = "h", col = "red", yaxt = "n", ylab = "",
xlab = "", main = "Occurences of Move Z by Year")
plot(movemat$YEAR, movemat$`1`, type = "h", col = "red", yaxt = "n", ylab = "",
xlab = "", main = "Occurences of Move 1 by Year")
plot(movemat$YEAR, movemat$`2`, type = "h", col = "red", yaxt = "n", ylab = "",
xlab = "", main = "Occurences of Move 2 by Year")
The first stage of the analysis is to compare string edit distance between all adjacent patent sequences. Basically string edit distance measures the numbers of operations (i.e. insertion, deletion, substitution) needed to transform one string to another.
To do this we use the stringdist() function and the default optimal string alignment metric (OSA), also known as “restricted Damereau-Levenshtein distance”, which seems to work as expected.
as.character(patents$CODE[247])
## [1] "xyop1zxjdm"
as.character(patents$CODE[248])
## [1] "xyop1zxjdm"
stringdist(patents$CODE[247], patents$CODE[248])
## [1] 0
as.character(patents$CODE[249])
## [1] "xyo1zxjdm"
stringdist(patents$CODE[247], patents$CODE[249])
## [1] 1
as.character(patents$CODE[245])
## [1] "xyop1zx2dme"
stringdist(patents$CODE[247], patents$CODE[245])
## [1] 2
as.character(patents$CODE[5])
## [1] "gabcdeif"
stringdist(patents$CODE[247], patents$CODE[5])
## [1] 10
To compare every temporally adjacent pair of move sequences, we loop through the ordered patents data and compare each pair of move sequences using stringdist(), saving the results in a new dataframe, using the year of the second string as an index.
adjacent <- data.frame()
for (i in c(2:nrow(patents))) {
adjacent[i - 1, "YEAR"] <- patents$YEAR[i]
adjacent[i - 1, "DIST"] <- stringdist(patents$CODE[i - 1], patents$CODE[i])
}
adjacent
We can then plot the distances between each adjacent pair on a bar plot, which in this case perhaps most notably shows a lot of variation from 1900-1975, aside from the time around WWII, and relative less variation since 1975.
barplot(adjacent$DIST, names.arg = adjacent$YEAR, main = "Adjacent Year Edit Distance",
ylab = "Edit Distance", col = "red", cex.names = 0.7)
We can also smooth out some of the noise by averaging over 10 year spans, for example, and plotting the result. This graph more clearly shows when the rhetorical structures of patents is least stable.
adjacentavg <- data.frame()
for (i in c(6:(nrow(adjacent) - 4))) {
adjacentavg[i - 5, "YEAR"] <- adjacent$YEAR[i]
adjacentavg[i - 5, "DIST"] <- (adjacent$DIST[i - 5] + adjacent$DIST[i -
4] + adjacent$DIST[i - 3] + adjacent$DIST[i - 2] + adjacent$DIST[i -
1] + adjacent$DIST[i] + adjacent$DIST[i + 1] + adjacent$DIST[i + 2] +
adjacent$DIST[i + 3] + adjacent$DIST[i + 4])/10
}
plot(adjacentavg$YEAR, adjacentavg$DIST, type = "l", col = "red", main = "Adjacent Year Edit Distance",
ylab = "Edit Distance")
We can also look at this another way by summing these distances over time and plotting the results. Overall this shows the same basic pattern, but it makes the trend clearer. From approximately 1750 to 1920 patent structures changes at a fairly consistent rate, but from approximately 1920 to the 1940 it changes at a much faster rate. Patent structure change then plateaus briefly between approximately 1940 and 1950, before speeding up again at a rate similar to the pre-1920 level until approximately 1970. Finally, since approximately 1970 we have seen very little change in patent structure. There are also some other potentially interesting inflection points (e.g. around 1800 and 1850).
cumulative <- data.frame()
cumulative[1, "YEAR"] <- adjacent$YEAR[1]
cumulative[1, "CUMDIST"] <- adjacent$DIST[1]
for (i in c(2:nrow(adjacent))) {
cumulative[i, "YEAR"] <- adjacent$YEAR[i]
cumulative[i, "CUMDIST"] <- adjacent$DIST[i] + cumulative$CUMDIST[i - 1]
}
plot(cumulative$YEAR, cumulative$CUMDIST, type = "l", col = "red", main = "Cumulative Adjacent Year Edit Distance Change over Time",
ylab = "Cumulative Edit Distance", xlab = "year")
Rather than just looking at the distance between the move sequences for patents from adjacent years, we can also look at string edit distance between all years and then use this information to cluster patents by year in various ways.
First, we make a distance matrix of string edit distances using the stringdistmatrix() function.
distmat <- as.dist(stringdistmatrix(patents$CODE, patents$CODE))
Then we can run a simple metric multidimensional scaling to dimension reduce this matrix containing the distance between all pairs of strings down to two dimensions.
fit <- cmdscale(distmat, k = 2)
The years can then be plotted along these two dimensions to visualise year clusters in the data. The results are a bit hard to see because of all the overlap, but there are two big clusters, one on the middle-left consisting primarily of early patents and one on the top-middle consisting primarily of later patents. The mid-to-late-19th century patents appear to be less consistent, being primarily spread out in smaller across the bottom of the graph.
x <- fit[, 1]
y <- fit[, 2]
plot(x, y, xlab = "Dimension 1", ylab = "Dimension 2", main = "MDS Plot of Patent Sequence Distances by Year",
type = "n")
text(x, y, labels = patents$YEAR, cex = 0.5, col = "red")
We also analyse the distance matrix by applying a hierarchical cluster analysis using Ward’s method, pulling out 5 main clusters (based on the dendrogram below) which we can see, when plotted, largely gives temporal clusters, although there is some interesting vacillation in the early 20th century, which corresponds roughly to a time of faster change (see above), does seem different that the other changes (and more inline with gradual change). But we should be careful here: hierarchical clustering is forcing hard clusters.
dendro <- as.dendrogram(hclust(distmat, method = "ward.D2"))
clusters <- cutree(dendro, k = 5)
clusters <- gsub("3", "X", clusters)
clusters <- gsub("4", "3", clusters)
clusters <- gsub("X", "4", clusters)
clusters
## [1] "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1"
## [18] "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1"
## [35] "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1"
## [52] "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1"
## [69] "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1"
## [86] "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1"
## [103] "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "2" "2" "2"
## [120] "2" "2" "2" "4" "2" "2" "2" "2" "2" "2" "2" "2" "2" "2" "2" "2" "2"
## [137] "2" "2" "2" "2" "2" "2" "2" "2" "2" "2" "2" "2" "3" "3" "3" "3" "3"
## [154] "3" "3" "4" "3" "3" "3" "3" "3" "3" "3" "3" "4" "4" "4" "4" "4" "3"
## [171] "4" "4" "4" "3" "3" "4" "4" "4" "3" "4" "4" "3" "4" "3" "4" "3" "4"
## [188] "4" "3" "4" "3" "4" "4" "3" "4" "3" "4" "3" "4" "3" "4" "4" "4" "4"
## [205] "4" "4" "4" "4" "4" "4" "3" "4" "4" "4" "4" "4" "4" "4" "4" "4" "4"
## [222] "4" "4" "4" "4" "4" "4" "4" "4" "4" "4" "4" "4" "4" "4" "4" "4" "4"
## [239] "4" "4" "4" "4" "5" "5" "5" "5" "5" "5" "5" "5" "5" "5" "5" "5" "5"
## [256] "5" "5" "5" "5" "5" "5" "5" "5" "5" "5" "5" "5" "5" "5" "5" "5" "5"
## [273] "5" "5" "5" "5"
plot(patents$YEAR, clusters, type = "l", col = "red", main = "Cluster Membership by Year",
ylab = "Cluster", xlab = "Year")
Finally we can then plot the dendrogram, which shows 5 main clusters.
dend_col = color_branches(dendro, k = 5)
par(cex = 0.3)
plot(dend_col)
Overall there seems to be two main results:
Individual patent moves usually appear and disappear very abruptly: they do not gradually become popular; they occur suddenly then usually occur consistently for some stretch before stopping suddenly. In other words, they do not usually appear or disappear gradually over multiple years. Similarly, in a small number of cases, a new move is just used once. There are also a small number of counter examples (e.g. inclusion of drawings), but these are the exceptions to the rule.
Patents (i.e. sequences of patent moves) usually change very gradually, even though individual moves change abruptly. Change in patent move structure does sometime plateau and the rate of change does sometimes change, but overall the change is clearly gradual. So therefore it would seem that despite the abruptness, the individual moves fall in and out of usage at different points in time, leading to gradual overall change.
So in general, it would appear that these results support a gradual evolution theory of patent move structure evolution, but also that somewhat counter-intuitively this gradual overall evolution is driven by very sudden changes at the individual move level.
That connection is surprising and doesn’t like it should necessarily be the case: a bunch of gradually change in the use of individual move usage would presumably result in a gradual overall change as well.
What is remarkable is that given these abrupt changes on the individual move level do not result (pretty much ever) in abrupt changes on the overall move sequence level. It seems like there are external pressures here (e.g. functional pressure to have patents change in response to technological and legal changes while still keeping patents comprehensible and consistent) to make abrupt changes but not make too many abrupt changes at once.
The main limitation right now is that we are only looking at one patent per year. Specifically, would the abrupt individual move changes hold if we had multiple patents per year? We’d probably see something more gradual within years, but the consistency of the result over so many different moves right now suggests nothing major would change.
In addition to including more patents per year, which would lead to some fairly major changes in the analysis presented in this document. there are lots of additional analysis that can be done:
Look at how the position of individual moves changes over time. Right now the individual move analysis ignores position. So it would be interesting to see, for example, if the position of moves tend to change early or late in their life cycles.
Do a multiple correspondence analysis of the binary move matrix. That would give us back a result a lot like a Biber-style MDA analysis – a series of Dimensions identifying moves that tend to occur together (or tend not to occur together) in the same patent (ignoring order). We could then plot these over time.
Look at n-grams of move sequences and repeat the analyses in Section 4.2, as well as multiple correspondence analysis described above.
The multivariate analysis can be refined in various ways, including the visualisations and interpretation. The clusters should be plotted onto the MDS graph, and the clustering of years in both analyses should be explored. The cluster analysis could also be done in some way that guarantees that clusters don’t overlap temporally (perhaps see the work by Gries and Hilpert)?