A Topic Model of Mobilizing the Past

or, Distantly Reading Digital Archaeology

Shawn Graham

2016-10-07

Introduction

A year or two ago, there was a workshop in the Boston area on Digital Archaeology1 Mobilizing the Past for a Digital Future: the Potential of Digital Archaeology conference site. The proceedings from that workshop are now available from the Digital Press at the University of North Dakota2 See the press blurb here. I thought it would be interesting to review that work by reading it distantly, and then see how it compares with an earlier important work on digital archaeology, Archaeology 2.03 Available here. Apparently, there’s an undiscovered ‘easter egg’ in that volume too. This current essay is step one in that program: generating, and reflecting, on a topic model.

Generating A Topic Model

There are any number of ways to generate, or fit, a topic model to a collection of materials. Since I was allowed to have an advance copy of the volume, and it arrived as a pdf, I extracted the text from the volume into a single text file. Since the two volumes are different lengths, and I want things to be roughly comparable, I instead broked each one into 500 line chunks, and ingested them.4 I used the command split -l 500 completevolume.txt at the terminal to achieve this. The actual code for the topic model (and especially, for comparing topic distribution within and across the volumes) was repurposed from Ben Marwick’s ‘Day of Archaeology’ code.5 ’A Distant Reading of a Day of Archaeology https://github.com/benmarwick/dayofarchaeology.

I then twiddled various knobs and dials and eventually decided that 30 topics was about right. Here’s what I found:

topics.labels
##  [1] "database used specialists troy finds"           
##  [2] "database students context research analysis"    
##  [3] "peru heritage photogrammetry center cultural"   
##  [4] "inka uav architectural tambo aerial"            
##  [5] "survey site figure results area"                
##  [6] "data field project recording projects"          
##  [7] "faims software module server testing"           
##  [8] "ethical christen intellectual cultural property"
##  [9] "digital archaeological new work tools"          
## [10] "ark commercial existing web minories"           
## [11] "development mobile local technical case"        
## [12] "commercial measurement appleõs computer kap"    
## [13] "services branding commercial jstor projects"    
## [14] "pladypos underwater caesarea vehicle harbor"    
## [15] "future workshop institute humanities ethical"   
## [16] "database paperless figure svp user"             
## [17] "app database pkapp html custom"                 
## [18] "project local chav’n members community"         
## [19] "technologies need colleagues drones want"       
## [20] "model virtual modeling software public"         
## [21] "open context kansa research public"             
## [22] "pompeii ellis poehler quadriporticus pdf"       
## [23] "like volume manifesto documentation human"      
## [24] "university research project ph.d professor"     
## [25] "workflows aap devices paper-based on-site"      
## [26] "site development workflows paleoway historic"   
## [27] "university journal press eds world"             
## [28] "efficiency practices trench practice slow"      
## [29] "fieldwork tablets paper tablet paperless"       
## [30] "used ellis ipad wallrodt ipads"

Check that these topics make sense

We know that Bill Caraher’s piece in Mobilizing the Past makes the case for a ‘slow’ archaeology.6 Bill has written a lot about slow archaeology; start here Let’s see what topics are present in his chapter:

library(ggplot2)
ggplot(topic.proportions.df, aes(topic, value, fill=document)) +
  geom_bar(stat="identity") +
  ylab("proportion") +
  theme(axis.text.x = element_text(angle=90, hjust=1)) +  
  coord_flip() +
  facet_wrap(~ document, ncol=5)

That makes a lot of sense, knowing what we know already about Bill’s work. What about Eric’s piece? (I feel safe in drawing attention to Bill and Eric’s pieces because they have already been shared online, in various ways)

library(ggplot2)
ggplot(topic.proportions.df, aes(topic, value, fill=document)) +
  geom_bar(stat="identity") +
  ylab("proportion") +
  theme(axis.text.x = element_text(angle=90, hjust=1)) +  
  coord_flip() +
  facet_wrap(~ document, ncol=5)

Finally, let’s look at the introduction to the volume:

library(ggplot2)
ggplot(topic.proportions.df, aes(topic, value, fill=document)) +
  geom_bar(stat="identity") +
  ylab("proportion") +
  theme(axis.text.x = element_text(angle=90, hjust=1)) +  
  coord_flip() +
  facet_wrap(~ document, ncol=5)