Word Vectors & Mobilizing the Past

Distantly Reading Digital Archaeology, Part II

Shawn Graham

2016-10-08

Top down, bottom up

In the first exploration of the text of Mobilizing the Past1 See the press blurb, download, here I generated a quick ‘n’ dirty topic model exploring some of the larger trends in the discourses within that volume. In a future exploration, I will compare it with an earlier volume on digitial archaeology from 2011.2 Kansa, Kansa, & Watrall, Archaeology 2.0. It’s available here.

If topic models give you a top-down perspective on what is happening with one’s corpus, word vectors reverse the view to let us see what is happening at the level of individual words. One thing that I find interesting to do is to define binary pairs, and see what words stretch along the continuum between them. Going further, as Ben Schmidt showed us,3 For more on word vectors, see this post of his.. we can explore the ways pairs of binaries intersect. So, let’s start exploring.4 You can get the package here.

Firstly, I have the text of the volume, all in lowercase, which I feed into the train_word2vec function:

Digital v. Analog

That done, what words are nearest, in the model, to ‘digital’?

nearest_to(blogmodel,blogmodel[["digital"]])
##       digital     paperless        review        divide       huggett 
## -6.661338e-16  5.116213e-01  5.290055e-01  6.030911e-01  6.058402e-01 
##     fetishism         syria         class           why introspective 
##  6.122098e-01  6.133435e-01  6.164904e-01  6.202798e-01  6.284196e-01

It’s quite clear that in Mobilizing the Past, ‘paperless’ archaeology is very much hand-in-glove with the idea of going digital. So, let’s create a vector from the idea of ‘paperless, digital’ archaeology:

digital_words = blogmodel %>% nearest_to(blogmodel[[c("digital","paperless")]],100) %>% names
sample(digital_words,50)
##  [1] "fully"                       "entangled"                  
##  [3] "strengths"                   "why"                        
##  [5] "beta"                        "divide"                     
##  [7] "missing"                     "workflows"                  
##  [9] "thing"                       "paperless"                  
## [11] "streamlining"                "seeing"                     
## [13] "think"                       "hybrid"                     
## [15] "initiative"                  "allison"                    
## [17] "worry"                       "dicus"                      
## [19] "implementing"                "benefit"                    
## [21] "lessons"                     "implications"               
## [23] "legacy"                      "semidigital"                
## [25] "transition"                  "suspect"                    
## [27] "kit"                         "introduction"               
## [29] "improvements"                "start"                      
## [31] "revolution"                  "established"                
## [33] "changed"                     "notion"                     
## [35] "consider"                    "dirt"                       
## [37] "agree"                       "six"                        
## [39] "dialogue"                    "exciting"                   
## [41] "vital"                       "fluid"                      
## [43] "nakassis"                    "demonstrates"               
## [45] "benefits"                    "manifesto"                  
## [47] "andor"                       "resulted"                   
## [49] "httppaperlessarchaeologycom" "thoughts"

I like particularly the idea that digital archaeology is ‘reflective’, ‘entangled’, and ‘exciting’. I certainly feel this way, and am glad to see it emerge in this volume (which, mark you, I still haven’t read yet). Let’s take a look at how these words relate to one another, using a dendrogram:

g1 = blogmodel[rownames(blogmodel) %in% digital_words [1:50],]

group_distances1 = cosineDist(g1,g1) %>% as.dist
plot(as.dendrogram(hclust(group_distances1)),cex=1, main="Cluster dendrogram of the fifty words closest to a 'digital' vector\nin Mobilizing the Past")

We don’t know, yet, how this ‘digital’ vector plays out in value-space: is ‘digital’ good? bad? Is it gendered? But before we do that, let’s look at an antonymn for ‘digital’: ‘analog’.

nearest_to(blogmodel,blogmodel[["analog"]])
##        analog         valid       amounts          wait       replace 
## -4.440892e-16  2.437904e-01  2.529104e-01  2.542137e-01  2.654687e-01 
##       whether    questioned         begun           yet          move 
##  2.732477e-01  2.844869e-01  2.845951e-01  2.919575e-01  3.057712e-01

A great deal more ambivalence. My initial impression here is not an opposition to digital, but rather, the role of digital in supplanting tried-and-true analog methods, and whether or not this is perhaps a wise idea. Let’s explore it a bit more:

analog_words = blogmodel %>% nearest_to(blogmodel[[c("analog")]],100) %>% names
sample(digital_words,50)
##  [1] "dialogue"      "allison"       "successes"     "you"          
##  [5] "introduction"  "nakassis"      "manifesto"     "improvements" 
##  [9] "review"        "workflows"     "benefits"      "resulted"     
## [13] "entangled"     "attempts"      "https"         "fully"        
## [17] "expand"        "agree"         "acknowledge"   "evolving"     
## [21] "bridging"      "kit"           "doing"         "lessons"      
## [25] "semidigital"   "introspective" "exciting"      "vital"        
## [29] "changed"       "hybrid"        "worry"         "paperbased"   
## [33] "six"           "missing"       "established"   "legacy"       
## [37] "benefit"       "transition"    "llobera"       "start"        
## [41] "streamlining"  "blog"          "slow"          "borndigital"  
## [45] "luke"          "save"          "really"        "fieldwork"    
## [49] "practitioners" "dicus"
g2 = blogmodel[rownames(blogmodel) %in% analog_words [1:50],]

group_distances2 = cosineDist(g2,g2) %>% as.dist
plot(as.dendrogram(hclust(group_distances2)),cex=1, main="Cluster dendrogram of the fifty words closest to an 'analog' vector\nin Mobilizing the Past")

A definite ambivalence there. Let’s see how the words run when we define a vector from digital through to analog.

ggplot(word_scores %>% filter(abs(mode_score)>.425)) + geom_bar(aes(y=mode_score,x=reorder(word,mode_score),fill=mode_score<0),stat="identity") + coord_flip()+scale_fill_discrete("Indicative of mode",labels=c("analog","digital")) + labs(title="The words showing the strongest skew along the analog-digital binary")

I think, perhaps, what this is suggesting is a fear of what we might be losing in the march of the digital, or maybe, things that we need to be aware of as we explore this area. It is interesting that the words that are most ‘digital’ in this vector seem to be in connection with publishing (or at least, that’s how I read ‘quarterly’ and ‘saa’ and ‘literary’).

Value Judgements?

Archaeologists are human. What things are ‘good’ in this model, and what things are ‘bad’? Turns out, the word ‘bad’ is not in the model at all. The closest term would seem to be ‘problematic’:

ggplot(word_scores %>% filter(abs(value_score)>.45)) + geom_bar(aes(y=value_score,x=reorder(word,value_score),fill=value_score<0),stat="identity") + coord_flip()+scale_fill_discrete("Indicative of medium",labels=c("good","problematic")) + labs(title="The words showing the strongest skew along the value binary")

That one doesn’t tell us much, other than to say perhaps these archaeologists are an optimistic bunch.

Gender?

Finally, let’s do the same again, by defining a ‘gender’ vector using pronouns.

ggplot(word_scores %>% filter(abs(gender_score)>.35)) + geom_bar(aes(y=gender_score,x=reorder(word,gender_score),fill=gender_score<0),stat="identity") + coord_flip()+scale_fill_discrete("Indicative of gender",labels=c("he","she")) + labs(title="The words showing the strongest skew along the gender binary")

Is ‘Digital’ Gendered? What is the value of ‘Digital’?

Finally, let us take these vectors and combine them in interesting ways. In the digital to analog vector, which words are gendered male, and which are gendered female? This involves crossing our ‘mode_vector’ against the ‘gender_vector’.

word_scores %>% mutate( modeedness=ifelse(mode_score>0,"analog","digital"),gender=ifelse(gender_score>0,"male","female")) %>% group_by(gender,modeedness) %>% filter(rank(-(abs(mode_score*gender_score)))<=36) %>% mutate(eval=-1+rank(abs(gender_score)/abs(mode_score))) %>% ggplot() + geom_text(aes(x=eval %/% 12,y=eval%%12,label=word,fontface=ifelse(modeedness=="analog",2,3),color=gender),hjust=0) + facet_grid(gender~modeedness) + theme_minimal() + scale_x_continuous("",lim=c(0,3)) + scale_y_continuous("") + labs(title="The top words gendered female (red) and male (blue)\n used with 'digital' (italics) and 'analog'(bold) words") + theme(legend.position="none")

Science is male. It’s large-scale, if it’s digital. It’s important, it’s an improvement, though it has limitations. If it’s digital, it’s open and reproducible. When it’s female and digital, it seems to be a footnote (or at least, that’s how I’m interpreting those fragments.) When it’s female and analog, it’s supervisors who communicate…5 This perhaps ought to be unpacked a bit more: which is what you do when you read distantly. You spot patterns, then dive into the text forwarned and forearmed, and come back again to rerun your distant reading. I’m not entirely happy with what’s going on in this diagram, which makes me think that the vectors have to be defined more carefully.

Let’s do the same again, this time comparing the ‘mode_vector’ against the ‘value_vector’

word_scores %>% mutate( modeedness=ifelse(mode_score>0,"analog","digital"),value=ifelse(value_score>0,"positive","negative")) %>% group_by(value,modeedness) %>% filter(rank(-(abs(mode_score*value_score)))<=36) %>% mutate(eval=-1+rank(abs(value_score)/abs(mode_score))) %>% ggplot() + geom_text(aes(x=eval %/% 12,y=eval%%12,label=word,fontface=ifelse(modeedness=="analog",2,3),color=value),hjust=0) + facet_grid(value~modeedness) + theme_minimal() + scale_x_continuous("",lim=c(0,3)) + scale_y_continuous("") + labs(title="The top negative (red) and positive (blue) words \n used with 'digital' (italics) and 'analog'(bold) words") + theme(legend.position="none")

In this one, I think the fact that the ‘value_vector’ runs from good to problematic is probably making things a bit squiffy. But there certainly seems to be a sense that the punk qualities of digital archaeology, the introspection, the craft of it, is broadly positive. The negative qualities of the digital, if I can unpick one thread, seem to perhaps connect with the teaching (or lack thereof) in the classroom.6 Again, vectors constructed more carefully than perhaps what I have done this evening would be clearer and make more sense.

To wind up this quick distant read

In this quick glance from a distance at Mobilizing the Past, we see a digital archaeology that in some respects is a continuation of the analog archaeology in which it is entwined. There’s a clear sense that digital is transforming the practice of archaeology, but also, that this is freighted with anxiety. Some of the greatest work in digital archaeology is explicitly feminist archaeology (thinking Tringham, Morgan, etc), and I don’t get that sense from this volume read at a distance.

You will draw different conclusions, of course. When you get your hands on the book, save all the text from the pdf into a txt file, make it all lowercase, and then grab my code and feed it the text yourself. Look for other interesting words or binaries. Correct the flaws in what I’ve done.

to come: a comparison of topics with Kansa, Kansa, & Watrall’s 2011 Archaeology 2.0; also, a similar exploration of word vectors therein. How far have we come over the last five years, if these two volumes are used as bookends?