6 Main Move Sequence Analysis
6.1 String Edit Distance
To look at change in discourse move structure over time in more detail, we use string edit distance to quantify the difference between the move strings, which measures the numbers of operations (insertion, deletion, substitution) needed to transform one string to another.
To do this we use the stringdist() function and the default optimal string alignment metric (OSA), also known as “restricted Damereau-Levenshtein distance”, which works as one would expect
as.character(patents$CODE[247])
## [1] "SLMUTLJEK"
as.character(patents$CODE[248])
## [1] "SLMUTLJEK"
stringdist(patents$CODE[247], patents$CODE[248])
## [1] 0
as.character(patents$CODE[245])
## [1] "SLUTLJEK"
stringdist(patents$CODE[247], patents$CODE[245])
## [1] 1
as.character(patents$CODE[5])
## [1] "ABCDEFG"
stringdist(patents$CODE[247], patents$CODE[5])
## [1] 9
6.2 Adjacent String Edit Distance
Fr every adjacent pair of move sequences in the time series, we first compare each pair of move sequences, using the year of the second string as an index.
adjacent <- data.frame()
for (i in c(2:nrow(patents))) {
adjacent[i - 1, "YEAR"] <- patents$YEAR[i]
adjacent[i - 1, "DIST"] <- stringdist(patents$CODE[i - 1], patents$CODE[i])
}
adjacent
We can then plot the distances between each adjacent pair, which in this case perhaps most notably shows a lot of variation from 1900-1975, aside from the time around WWII, and relative less variation since 1975.
barplot(adjacent$DIST, names.arg = adjacent$YEAR, main = "Adjacent Year Edit Distance",
ylab = "Edit Distance", col = "seagreen3", border = NA, cex.names = 0.7,
space = 0)
We can also smooth out the time series and plot that. This helps us see patterns more clearly in the data and also controls for some extent that we are only comparing adjacent years, effectively openning up the comparison window.
Here is a 10-year moving average.
smth <- ma(as.ts(adjacent$DIST), 10, centre = TRUE)
plot(adjacent$YEAR, smth, type = "l", col = "seagreen3", main = "Adjacent Year Edit Distance (Smoothed)",
ylab = "Edit Distance", xlab = "Year")
And here is a 25-year moving average.
smth <- ma(as.ts(adjacent$DIST), 25, centre = TRUE)
plot(adjacent$YEAR, smth, type = "l", col = "seagreen3", main = "Adjacent Year Edit Distance (Smoothed)",
ylab = "Edit Distance", xlab = "Year")
An alternative way to visualise the rate of change of patent structure, which also helps to smooth over some of the short-term variability, is to sum these distances over time and plot the results, which yields a cumulative (monotonic) time series, where the sharper the slope of the line connecting two time points, the greater the amount of change.
cumulative <- data.frame()
cumulative[1, "YEAR"] <- adjacent$YEAR[1]
cumulative[1, "CUMDIST"] <- adjacent$DIST[1]
for (i in c(2:nrow(adjacent))) {
cumulative[i, "YEAR"] <- adjacent$YEAR[i]
cumulative[i, "CUMDIST"] <- adjacent$DIST[i] + cumulative$CUMDIST[i - 1]
}
plot(cumulative$YEAR, cumulative$CUMDIST, type = "l", col = "seagreen3", main = "Cumulative Adjacent Year Edit Distance Change over Time",
ylab = "Cumulative Edit Distance", xlab = "Year")
It’s also informative to zoom in on this time series a bit to see the detail better.
plot(cumulative$YEAR[c(1:60)], cumulative$CUMDIST[c(1:60)], type = "l", col = "seagreen3",
lwd = 3, main = "Cumulative Adjacent Year Edit Distance Change over Time 1740-1800",
ylab = "Cumulative Edit Distance", xlab = "Year")
plot(cumulative$YEAR[c(60:160)], cumulative$CUMDIST[c(60:160)], type = "l",
col = "seagreen3", lwd = 3, main = "Cumulative Adjacent Year Edit Distance Change over Time 1800-1900",
ylab = "Cumulative Edit Distance", xlab = "Year")
plot(cumulative$YEAR[c(140:271)], cumulative$CUMDIST[c(140:271)], type = "l",
col = "seagreen3", lwd = 3, main = "Cumulative Adjacent Year Edit Distance Change over Time 1880-2011",
ylab = "Cumulative Edit Distance", xlab = "Year")
6.2.1 Summary
This analysis of adjacent move sequence distance also tells us various things about how patents change over time:
There is a lot of variability and there is almost always some variability year to year.
But the degree of variability is relatively unstable and doesn’t seem to follow some kind of general trend, suggesting a lot of external factors are at play,
In general, the variability in moves decreases moderately although somewhat inconsistently from 1750-1900, but it really increases from around 1900 until the 1940s, at which point it drops sharply, with another sharp drop and subsequently especially low variability after around 1975.
Secondary prominent inflection points include fairly big drops around 1775, 1810, and 1860, with fairly big jumps arround 1760, 1800, and 1850.
In general it appears like a big jump in variability will be followed within a decade or so by a fairly big fall, suggesting that although the structure is always in flux, big changes come in burst, and then things will settle down for a bit.
Looking at the cumulative graphs in particular, we see that when we a fairly stable average rate of change up until about 1900 and then it really pick up until almost 1950.
Nevertheless, when we zoom in we can see that in general, although on a smaller scale (fractal-like), we can see this gradual-burst-gradual-burst pattern repeats itself, with the slower change periods generally be longer than the faster change periods.
Overall, thee results are consistent with an s-curve theory of language change, or maybe more specifically of a theory of s-curves being composed of smaller s-curves, which I believe has been discussed somewhere in the literature (Janet Holmes)?
In terms of how patent structure changes, these results suggests it is a continous and often gradual process, punctuated by quick burst of change at certain points, which are often followed by periods of relative calm.
6.3 All String Edit Distances
Rather than just comparing adjacent strings, which is a bit artificial and limiting, especially because competing forms often alternate over a decade or more, we also looked at the edit distance between all pairs of moves sequences.
6.3.1 String Distance Matrix
First, we make a distance matrix of string edit distances using the stringdistmatrix() function.
distmat <- as.dist(stringdistmatrix(patents$CODE, patents$CODE))
6.3.2 Multidimensional Scaling
We then ran a simple metric multidimensional scaling to dimension reduce this matrix containing the distance between all pairs of strings down to two dimensions.
fit <- cmdscale(distmat, k = 2)
The years can then be plotted along these two dimensions to visualise clusters of years in the data. The results are a bit hard to see because of all the overlap, but there are three big clusters: early years on the left, later years in the middle, and middling years on the right. This right cluster is also more diffuse, reflecting the greater rate of change in the middling years
x <- fit[, 1]
y <- fit[, 2]
plot(x, y, xlab = "Dimension 1", ylab = "Dimension 2", main = "MDS Plot of Patent Sequence Distances by Year",
type = "n")
text(x, y, labels = patents$YEAR, cex = 0.5, col = "seagreen3")
We can pull these clusters out a bit better if we k-means these years over the 2 dimensions into three clusters, and we can then plot those over time.
Note that k-means can give a different result each time it runs since the initial seeds are placed at random. This effects both the (arbitrary) order of clusters, which does matter to us, since we want to plot over time, and the actual membership, which is not arbitrary. In terms of membership, I’ve run it multiple times and about 80% of the time it’s giving back the clusters one would expect, but the other 20% it’s classifying the upper few years in the right cluster with the middle cluster, although that does lead to a simpler picture when plotting the clusters over time (less overlap). So anyway, when running this, make sure the final run is a good one.
clusts <- kmeans(fit, centers = 3, iter.max = 10000)
color = c()
color[clusts$cluster == 1] <- "seagreen3"
color[clusts$cluster == 2] <- "violetred3"
color[clusts$cluster == 3] <- "darkorange1"
plot(x, y, xlab = "Dimension 1", ylab = "Dimension 2", main = "MDS Plot of Patent Sequence Distances by Year",
type = "n")
text(x, y, labels = patents$YEAR, cex = 0.5, col = color)
boxplot(patents$YEAR ~ clusts$cluster, ylab = "Cluster", names = c(), xlab = "Year",
main = "Patent Distance Clusters", col = c("darkorange1", "violetred3",
"seagreen3"), horizontal = TRUE)
The boxplot show the relationship between these three clusters over time nore clearly: the one on the middle-left consists entirely of early patents (pre-1850), the one on the top-middle consists primarily of later patents (post-1900), and the one on the right consists primarily of patents between these two points (1850-1900), although there is some overlap between the later two eras, especially between 1900-1940, indicating that there is competition betweent these two general types of patent move sequences around that time
Finally, we can also plot these two dimensions individually against time, and also smooth them so that we can abstract away a bit from the year-to-year noisiness and focus in on the larger trends.
plot(patents$YEAR, x, type = "l", col = "seagreen3", main = "Edit Distance (Dimension 1)",
ylab = "MDS 1", xlab = "Year")
s_dim1 <- ma(as.ts(x), 10, centre = TRUE)
plot(patents$YEAR, s_dim1, type = "l", col = "seagreen3", main = "Edit Distance (Dimension 1, Smoothed)",
ylab = "MDS 1", xlab = "Year")
plot(patents$YEAR, y, type = "l", col = "violetred3", main = "Edit Distance (Dimension 2)",
ylab = "MDS 2", xlab = "Year")
s_dim2 <- ma(as.ts(y), 10, centre = TRUE)
plot(patents$YEAR, s_dim2, type = "l", col = "violetred3", main = "Edit Distance (Dimension 2, Smoothed)",
ylab = "MDS 2", xlab = "Year")
And we can combine them.
plot(patents$YEAR, s_dim1, type = "l", col = "seagreen3", ylim = c(-8, 8), main = "Edit Distance (Dimension 1 + 2, Smoothed)",
ylab = "MDS 1 + 2", xlab = "Year")
lines(patents$YEAR, s_dim2, col = "violetred3")
6.3.3 Summary
Overall, we therefore get 4 main eras: 1740-1850, 1850-1900, 1900-1940, 1940-2011, with the 3rd era being transitionary.
…