A year or two ago, there was a workshop in the Boston area on Digital Archaeology1 Mobilizing the Past for a Digital Future: the Potential of Digital Archaeology conference site. The proceedings from that workshop are now available from the Digital Press at the University of North Dakota2 See the press blurb here. I’ve already generated, and discussed, both a topic model and a word vector model of Mobilizing the Past3 See Electric Archaeology, October 10 2016. Now, I’m ready to generate a topic model across both volumes to see how this slippery beast ‘digital archaeology’ has changed since Archaeology 2.0 came out in 2011.4 Available here
I split both volumes into 500 line chunks, and then, re-using a script by Ben Marwick first developed to explore change over time in what archaeologists write about during the Day of Archaeology5 Day of Archaeology; Ben’s script and analysis, I fit the topic model and see how the topics break down by volume.
Again, if you check the code for this r-markdown document, you’ll see exactly how to do this on your own materials. I fed R a csv where each line was a 500 line bin from the texts, and where there were columns indicating the author, the chapter, and the year for that text chunk. This material had been made all lowercase beforehand (using the tr command at the terminal prompt). Then, we use a standard stop-words file to remove the ‘if and but’ etc, all those wee words that (in this case) get in the way. The topic model is set to generate 30 topics, using the MALLET package.6 documents <- data.frame(text = mobtext$text, id = make.unique(mobtext$id), class = mobtext$year, author = mobtext$author, stringsAsFactors=FALSE) mallet.instances <- mallet.import(documents$id, documents$text, "en.txt", token.regexp = "\\p{L}[\\p{L}\\p{P}]+\\p{L}")
Let’s take a look at what we find:
topics.labels
## [1] "semantic web cyberinfrastructure computing united"
## [2] "workflows development paleoway historic deliverables"
## [3] "archaeological digital new technologies systems"
## [4] "field project excavation used paper"
## [5] "recording fieldwork paperless tablets ipad"
## [6] "press journal university eds practice"
## [7] "media social content figure bonecommons"
## [8] "faims ark recording software development"
## [9] "tools slow efficiency practices practice"
## [10] "site using figure results work"
## [11] "local chav’n community world members"
## [12] "web open context content information"
## [13] "students database context research mobile"
## [14] "repository access open repositories information"
## [15] "virtual model recording modeling software"
## [16] "process time archaeologists example information"
## [17] "app html pkapp database mobile"
## [18] "web files preservation critical repository"
## [19] "digital papers kersel ethical life"
## [20] "project inka photogrammetry uav aerial"
## [21] "pladypos underwater caesarea vehicle harbor"
## [22] "iaks project web iadb virtual"
## [23] "pompeii ellis season fig paperless"
## [24] "university research project volume institute"
## [25] "digital caraher recording mobile e.g"
## [26] "data research use different access"
## [27] "open context kansa projects ethical"
## [28] "like human manifesto documentation platforms"
## [29] "objects museum knowledge catalog social"
## [30] "project records archaeotools information literature"
Only one author writes in both volumes, Eric Kansa. Eric is well known for his advocacy of Open Access approaches in archaeology, and in particular, to data. Looking at the titles of his works in these volumes, the first is very much about building infrastructure and the second about the unintended consequences of techno-utopianism in archaeology. So we’ll get R to show us the topic breakdown in the chunks (remember, 500 line bins) that Eric wrote.
library(ggplot2)
ggplot(topic.proportions.df, aes(topic, value, fill=document)) +
geom_bar(stat="identity") +
ylab("proportion") +
theme(axis.text.x = element_text(angle=90, hjust=1)) +
coord_flip() +
facet_wrap(~ document, ncol=5)
And happily, we see what we expected to see, which gives us reassurance that the topics we see in the pieces we haven’t read are sensible.7 Obviously, this isn’t a fool-proof approach, but it’s good enough for this quick and dirty distant read.
Dendrograms are handy for topic models because they give you a good sense of how the topics relate to one another. Keep in mind that we have both volumes in here, so you should see a bit of difference in these topics than in the previous parts of this little series.
plot(hclust(dist(topic.words)), labels=topics.labels)
A Dendrogram of 30 Topics in Mobilizing the Past and Archaeology 2.0
I can’t quite get the hang of sizing these damned figures properly. Ah well. I will point out the interesting split at about the 0.10 level in that dendrogram between publishing (journals and data), and teaching/fieldwork (and it’s interesting that in archaeology teaching so often does mean fieldwork. I’m in a history department these days, and I think it’s an interesting model historians could be exploring8 to see exactly that in action, see MSU’s CHI Initiative Fieldschool).
library(reshape2)
df3m <- melt(df3[,-4], id = 3)
ggplot(df3m, aes(fill = as.factor(topic), topic, value)) +
geom_bar(stat="identity") +
coord_flip() +
facet_wrap(~ variable)
Topics in 2011 v 2016
There are some pretty clear differences. I’ll leave it up to you to explore those differences for yourself! Clearly though, one main difference is that Mobilizing the Past has more case studies, more site-specific explorations of particular issues in digital archaeology (in keeping with the editors’ original wish to generate a kind of handbook perhaps?) where has Archaeology2.0 is more about laying the groundwork for understanding what ‘digital archaeology’ was (at that time, in that (North American) place). While I would’ve thought that perhaps both volumes would not be so clearly distinct - they’re both about digital archaeology, both covering the same thing, the field is sorted, right? - that’s not the case. The digital moves fast, but I think what we’re seeing here is that we are not reinventing the wheel. These two volumes do not compete, so much as they complement.
And so we progress.