Introduction

A year or two ago, there was a workshop in the Boston area on Digital Archaeology11 Mobilizing the Past for a Digital Future: the Potential of Digital Archaeology conference site. The proceedings from that workshop are now available from the Digital Press at the University of North Dakota22 See the press blurb here. I thought it would be interesting to review that work by reading it distantly, and then see how it compares with an earlier important work on digital archaeology, Archaeology 2.033 Available here. Apparently, there’s an undiscovered ‘easter egg’ in that volume too. This current essay is step one in that program: generating, and reflecting, on a topic model.

Generating A Topic Model

There are any number of ways to generate, or fit, a topic model to a collection of materials. Since I was allowed to have an advance copy of the volume, and it arrived as a pdf, I extracted the text from the volume into a single text file. Since the two volumes are different lengths, and I want things to be roughly comparable, I instead broked each one into 500 line chunks, and ingested them.44 I used the command split -l 500 completevolume.txt at the terminal to achieve this. The actual code for the topic model (and especially, for comparing topic distribution within and across the volumes) was repurposed from Ben Marwick’s ‘Day of Archaeology’ code.55 ’A Distant Reading of a Day of Archaeology https://github.com/benmarwick/dayofarchaeology.

I then twiddled various knobs and dials and eventually decided that 30 topics was about right. Here’s what I found:

topics.labels

##  [1] "database used specialists troy finds"           
##  [2] "database students context research analysis"    
##  [3] "peru heritage photogrammetry center cultural"   
##  [4] "inka uav architectural tambo aerial"            
##  [5] "survey site figure results area"                
##  [6] "data field project recording projects"          
##  [7] "faims software module server testing"           
##  [8] "ethical christen intellectual cultural property"
##  [9] "digital archaeological new work tools"          
## [10] "ark commercial existing web minories"           
## [11] "development mobile local technical case"        
## [12] "commercial measurement appleõs computer kap"    
## [13] "services branding commercial jstor projects"    
## [14] "pladypos underwater caesarea vehicle harbor"    
## [15] "future workshop institute humanities ethical"   
## [16] "database paperless figure svp user"             
## [17] "app database pkapp html custom"                 
## [18] "project local chav’n members community"         
## [19] "technologies need colleagues drones want"       
## [20] "model virtual modeling software public"         
## [21] "open context kansa research public"             
## [22] "pompeii ellis poehler quadriporticus pdf"       
## [23] "like volume manifesto documentation human"      
## [24] "university research project ph.d professor"     
## [25] "workflows aap devices paper-based on-site"      
## [26] "site development workflows paleoway historic"   
## [27] "university journal press eds world"             
## [28] "efficiency practices trench practice slow"      
## [29] "fieldwork tablets paper tablet paperless"       
## [30] "used ellis ipad wallrodt ipads"

Check that these topics make sense

We know that Bill Caraher’s piece in Mobilizing the Past makes the case for a ‘slow’ archaeology.66 Bill has written a lot about slow archaeology; start here Let’s see what topics are present in his chapter:

library(ggplot2)
ggplot(topic.proportions.df, aes(topic, value, fill=document)) +
  geom_bar(stat="identity") +
  ylab("proportion") +
  theme(axis.text.x = element_text(angle=90, hjust=1)) +  
  coord_flip() +
  facet_wrap(~ document, ncol=5)

That makes a lot of sense, knowing what we know already about Bill’s work. What about Eric’s piece? (I feel safe in drawing attention to Bill and Eric’s pieces because they have already been shared online, in various ways)

library(ggplot2)
ggplot(topic.proportions.df, aes(topic, value, fill=document)) +
  geom_bar(stat="identity") +
  ylab("proportion") +
  theme(axis.text.x = element_text(angle=90, hjust=1)) +  
  coord_flip() +
  facet_wrap(~ document, ncol=5)

Finally, let’s look at the introduction to the volume:

library(ggplot2)
ggplot(topic.proportions.df, aes(topic, value, fill=document)) +
  geom_bar(stat="identity") +
  ylab("proportion") +
  theme(axis.text.x = element_text(angle=90, hjust=1)) +  
  coord_flip() +
  facet_wrap(~ document, ncol=5)

The preface (which can be read online, see the link above in the sidenote) discusses how the organizers had hoped that some sort of handbook or how-to for digital archaeology might emerge from their meeting; the distribution of topics here reflect that concern for method. So, this wee check reassures us that the topics that we found are making sense, given what we might naturally expect knowing the context.

Topics across the entire volume

Let’s see how topics play out across the entire volume. First, let’s generate a dendrogram to see how the topics relate to one another. This can help us work out which topics are likely subtopics of one another, or otherwise close to one another in the semantic space of this volume. (My apologies if this dendrogram spreads down the page. Can’t quite get the hang of this layout package.77 Tufte)

plot(hclust(dist(topic.words)), labels=topics.labels)

A Dendrogram of 30 Topics in Mobilizing the Past

We can also visualize the average proportions of each topic by chapter, as well as by section.

ggplot(df3m, aes(fill = as.factor(topic), topic, value)) +
  geom_bar(stat="identity") +
  coord_flip()  +
  facet_wrap(~ variable)

Topics over Chapters

It’s quite clear here that most authors stick to one or two main topics in their chapters - which makes sense; a chapter shouldn’t be all over the map, one intuits, if it is to be of interest and utility. What is a bit more interesting is the way the book chapters, in aggregate, cover different aspects of things digital. I take this as a healthy sign that the field hasn’t congealed yet, that there is a wide possibility-space of ways to be digital and archaeological. The first section is thematically more varied. Section two seems more concerned with the mobile web and public reception (which the titles of the papers would not lead you to expect; is my topic model not fine-grained enough?). Section three: all workflows. Section four: ethics. The book seems to wind up with a consideration of the institutional contexts.

So what have we learned?

I can hear you now: Couldn’t you, Shawn, have said the same thing by perusing the table of contexts? Well, yes, no doubt, to a degree. But here we can see how topics group, and knowing how they average across the volume gives us a real sense of where the field - as these practitioners see it - is and where it’s heading.

And we still haven’t read a damned thing. Keep that in mind: this is just the result of the computer counting words. Speaking of which, in a subsequent post, I’ll look at word use in particular. Topic models give us a top down look at things; word vectors give us the bottom up.88 For more on word vectors, see Ben Schmidt. This will give us a sense of whether the ‘brogrammer’ culture99 ugh has crept into archaeology, how digital work in archaeology is gendered1010 ‘Some of us are brave’JDH and so on.

And of course, as archaeologists, it’s change over time that really piques our interest. I will eventually compare this volume with the earlier volume edited by Kansa, Kansa, and Watrall, with topic models and word vectors. But I think at this point, you can spot interesting patterns in what’s going on, and I invite you to use hypothesis to annotate this little public experiment.

A Topic Model of Mobilizing the Past

or, Distantly Reading Digital Archaeology

Shawn Graham

2016-10-07

Introduction

Generating A Topic Model

Check that these topics make sense

Topics across the entire volume

So what have we learned?