Topic Modeling Archaeological Diaries

Open Context has an excellent API for grabbing the archaeological data it holds. For instance, it has Open Context 2618 site diaries - here’s one of them. Append .json on the end of a file name, and poof, lots of data. Here’s the json version of that same diary. So, I wanted all of those diaries - this URL (click & then note where the .json lives; delete the .json to see the regular html) has ’em all.

I copied and pasted that list of urls into a .txt file, and fed it to wget at the command line:

wget -i urlstograb.txt -O output.txt

which gave me a file some 27 mb in size. Now, it’s possible to feed json into R, but the problem was that that single file was not formatted as a single json file, which is what R expects. Rather, it was 2600 ish individual json files, all wrapped together. This makes things awkward. My solution - and there are no doubt more elegant ones - was to use regex to mark off every line that contained a diary, and then deleting everything else. Then I used regex again to strip out the html. Nearly every entry begins with ‘summary journal’ in it, so I used that as a marker again, inserting a pipe character | and then opening in Excel. Excel says, hey, what caracter are you using to delimit this? I say, the pipe! and Excel turns it into a comma separated file, with two columns, Id and Text. There were some 1200 diaries that came from (an)other excavation(s) where there was clearly a style guide for what went into them; I removed these from the present analysis.

The 1400 diaries then that I topic model in R I am also exploring with Textplot. Texplot eventually produces a network graph; modularity of that graph finds 13 very strong groups. Accordingly, I topic model for 13 topics.

I want to see how topics via topic models compare versus topics via Textplot.

setwd("/Users/shawngraham/Desktop/data mining and tools/open-context/")

Then we import our data:

#Rio makes this easy. I'm assuming you know how to install packages.

library("rio")

##importing opencontext site diaries
documents <- import("diaries-in-date-order-R.csv")

Then we set up Mallet in R. We need as much memory as we’ve got, so:

options(java.parameters = "-Xmx5120m")
library(rJava)
## from http://cran.r-project.org/web/packages/mallet/mallet.pdf
library(mallet)

Now we pass the diaries to Mallet. There’s a bit of iteration here; I ran the entire topic modeling code until I got a list of the words and their frequencies (which you’ll see further below); the most frequent words are frequent by orders of magnitude, so I insert them into my stopword list so that they don’t overwhelm the analysis.

mallet.instances <- mallet.import(as.character(documents$id), as.character(documents$text), "/Users/shawngraham/Desktop/data mining and tools/TextAnalysisWithR/data/stoplist2.csv", FALSE, token.regexp="\\p{L}[\\p{L}\\p{P}]+\\p{L}")

#set the number of desired topics
num.topics <- 13
topic.model <- MalletLDA(num.topics)

## Load our documents. We could also pass in the filename of a
## saved instance list file that we build from the command-line tools.
topic.model$loadDocuments(mallet.instances)
## Get the vocabulary, and some statistics about word frequencies.
## These may be useful in further curating the stopword list.
vocabulary <- topic.model$getVocabulary()
word.freqs <- mallet.word.freqs(topic.model)
head(word.freqs)

##      words term.freq doc.freq
## 1     july       100       94
## 2  working       188      156
## 3 measured        55       44
## 4     side       970      405
## 5     hill        60       43
## 6  prickly         2        2

# write.csv(word.freqs, "oc-word-freqs.csv" ) <- study this file, use it to modify your stoplist file

So let’s see what we get:

## Optimize hyperparameters every 20 iterations,
## after 50 burn-in iterations.
topic.model$setAlphaOptimization(20, 50)
## Now train a model. Note that hyperparameter optimization is on, by default.
## We can specify the number of iterations. Here we'll use a large-ish round number.
topic.model$train(1000)
## NEW: run through a few iterations where we pick the best topic for each token,
## rather than sampling from the posterior distribution.
topic.model$maximize(10)
## Get the probability of topics in documents and the probability of words in topics.
## By default, these functions return raw word counts. Here we want probabilities,
## so we normalize, and add "smoothing" so that nothing has exactly 0 probability.
doc.topics <- mallet.doc.topics(topic.model, smoothed=T, normalized=T)
topic.words <- mallet.topic.words(topic.model, smoothed=T, normalized=T)
# save(doc.topics, file = "ocdoctopics.RData") <-we need this for doing self-organized maps. We'll worry about these later.

What are the top words in each topic?

mallet.top.words(topic.model, topic.words[1,])

##      words    weights
## 1     half 0.01709561
## 2  digging 0.01668474
## 3      pot 0.01331559
## 4      dug 0.01282254
## 5  pottery 0.01249384
## 6      day 0.01241167
## 7     dirt 0.01200080
## 8  section 0.01191862
## 9    began 0.01175427
## 10    find 0.01117905

mallet.top.words(topic.model, topic.words[2,])

##         words    weights
## 1        oven 0.11048688
## 2    mudbrick 0.03683250
## 3     feature 0.01494587
## 4        slag 0.01193155
## 5       large 0.01153838
## 6  underneath 0.01140732
## 7        july 0.01127626
## 8        work 0.01114520
## 9     digging 0.01088309
## 10       time 0.01009674

mallet.top.words(topic.model, topic.words[3,])

##         words    weights
## 1   excavated 0.01400361
## 2     section 0.01327059
## 3  excavation 0.01217106
## 4      sample 0.01217106
## 5    northern 0.01202446
## 6   structure 0.01143805
## 7      corner 0.01099824
## 8      season 0.01099824
## 9    material 0.01092493
## 10      ubaid 0.01085163

mallet.top.words(topic.model, topic.words[4,])

##     words    weights
## 1   brick 0.03488463
## 2   layer 0.03209051
## 3     mud 0.02977662
## 4     ash 0.02685153
## 5   north 0.02615300
## 6   south 0.02554178
## 7  bricks 0.02331522
## 8    west 0.02270400
## 9    east 0.02139426
## 10   line 0.01615528

mallet.top.words(topic.model, topic.words[5,])

##         words    weights
## 1       walls 0.02582379
## 2        room 0.01776622
## 3       began 0.01628625
## 4   collapsed 0.01529961
## 5    collapse 0.01464185
## 6    d/trench 0.01398409
## 7       house 0.01381965
## 8     furnace 0.01381965
## 9  discovered 0.01217524
## 10       june 0.01184636

mallet.top.words(topic.model, topic.words[6,])

##       words     weights
## 1    meters 0.027028264
## 2      work 0.016873764
## 3     meter 0.016529543
## 4  f/trench 0.013087340
## 5      east 0.011194128
## 6  trenches 0.010333577
## 7      tepe 0.010161466
## 8     began 0.009989356
## 9   located 0.009817246
## 10     time 0.009817246

mallet.top.words(topic.model, topic.words[7,])

##           words    weights
## 1        burial 0.08580118
## 2         bones 0.03547480
## 3       burials 0.02316090
## 4         human 0.01914550
## 5          bone 0.01727165
## 6      a/trench 0.01727165
## 7         skull 0.01365778
## 8      skeleton 0.01272086
## 9  turkey/kenan 0.01057931
## 10    tepe/area 0.01031162

mallet.top.words(topic.model, topic.words[8,])

##           words    weights
## 1      sounding 0.03687594
## 2  turkey/kenan 0.02269521
## 3     tepe/area 0.02269521
## 4          july 0.01881417
## 5          step 0.01836636
## 6        bottom 0.01791854
## 7       topsoil 0.01762000
## 8      mudbrick 0.01702292
## 9     excavated 0.01478386
## 10      started 0.01448531

mallet.top.words(topic.model, topic.words[9,])

##           words    weights
## 1  turkey/kenan 0.05480939
## 2     tepe/area 0.05480939
## 3         today 0.04440979
## 4           day 0.02681425
## 5         daily 0.02464561
## 6          july 0.02306842
## 7      tomorrow 0.01616821
## 8         floor 0.01385171
## 9        taking 0.01271810
## 10         work 0.01261953

mallet.top.words(topic.model, topic.words[10,])

##        words     weights
## 1      rocks 0.031244287
## 2       side 0.021873372
## 3       west 0.021092462
## 4       east 0.013283365
## 5       flat 0.012502456
## 6  compacted 0.012307228
## 7       balk 0.010940636
## 8    pottery 0.009964499
## 9       line 0.009574045
## 10     large 0.008988362

mallet.top.words(topic.model, topic.words[11,])

##      words    weights
## 1    rocks 0.03842486
## 2     part 0.03314275
## 3     side 0.02930779
## 4    south 0.02851186
## 5     rock 0.02778828
## 6    north 0.02706470
## 7     west 0.02597934
## 8     east 0.02402568
## 9   corner 0.02373625
## 10 pottery 0.01932243

mallet.top.words(topic.model, topic.words[12,])

##        words    weights
## 1       loci 0.02812991
## 2      small 0.02165249
## 3    removed 0.01972000
## 4     stones 0.01778751
## 5      large 0.01768015
## 6     season 0.01700020
## 7  excavated 0.01657076
## 8      stone 0.01549715
## 9   surfaces 0.01524664
## 10      part 0.01370781

mallet.top.words(topic.model, topic.words[13,])

##       words    weights
## 1  mudbrick 0.05053302
## 2   section 0.02497572
## 3     south 0.01910048
## 4      east 0.01821919
## 5   plaster 0.01689726
## 6     north 0.01528157
## 7       cut 0.01410652
## 8      side 0.01322524
## 9      west 0.01307836
## 10      top 0.01175643

Let’s see what the topic labels look like.

###from my other script; above was mimno's example script
topic.docs <- t(doc.topics)
topic.docs <- topic.docs / rowSums(topic.docs)
write.csv(topic.docs, "oc-topics-docs.csv" ) ## "C:\\Malletopic-docs.csv"
## Get a vector containing short names for the topics
topics.labels <- rep("", num.topics)
for (topic in 1:num.topics) topics.labels[topic] <- paste(mallet.top.words(topic.model, topic.words[topic,], num.top.words=5)$words, collapse=" ")
# have a look at keywords for each topic
topics.labels

##  [1] "half digging pot dug pottery"                
##  [2] "oven mudbrick feature slag large"            
##  [3] "excavated section excavation sample northern"
##  [4] "brick layer mud ash north"                   
##  [5] "walls room began collapsed collapse"         
##  [6] "meters work meter f/trench east"             
##  [7] "burial bones burials human bone"             
##  [8] "sounding turkey/kenan tepe/area july step"   
##  [9] "turkey/kenan tepe/area today day daily"      
## [10] "rocks side west east flat"                   
## [11] "rocks part side south rock"                  
## [12] "loci small removed stones large"             
## [13] "mudbrick section south east plaster"

And if we turn them into word clouds, we get a sense of the relative importance of words that appear in different topics.

## Loading required package: RColorBrewer

## NULL

## NULL

## Warning in wordcloud(topic.top.words$words, topic.top.words$weights, c(4,
## : southern could not be fit on page. It will not be plotted.

## Warning in wordcloud(topic.top.words$words, topic.top.words$weights, c(4,
## : sounding could not be fit on page. It will not be plotted.

## NULL

## NULL

## NULL

## NULL

## NULL

## NULL

## NULL

## NULL

## NULL

## NULL

## NULL

##generate clusters of diaries that are similar in their distribution of topics
topic_docs <- data.frame(topic.docs)
names(topic_docs) <- documents$id



library(cluster)
topic_df_dist <- as.matrix(daisy(t(topic_docs), metric = "euclidean", stand = TRUE))
# Change row values to zero if less than row minimum plus row standard deviation
# keep only closely related documents and avoid a dense spagetti diagram
# that's difficult to interpret (hat-tip: http://stackoverflow.com/a/16047196/1036500)
topic_df_dist[ sweep(topic_df_dist, 1, (apply(topic_df_dist,1,min) + apply(topic_df_dist,1,sd) )) > 0 ] <- 0

#' Use kmeans to identify groups of similar diaries

km <- kmeans(topic_df_dist, num.topics)

# get names for each cluster
allnames <- vector("list", length = num.topics)
for(i in 1:num.topics){
  allnames[[i]] <- names(km$cluster[km$cluster == i])
}
allnames

## [[1]]
##   [1] "142"  "146"  "172"  "197"  "211"  "222"  "262"  "326"  "332"  "343" 
##  [11] "345"  "354"  "376"  "379"  "410"  "413"  "494"  "512"  "515"  "521" 
##  [21] "528"  "535"  "565"  "577"  "606"  "614"  "622"  "624"  "645"  "653" 
##  [31] "656"  "664"  "697"  "736"  "752"  "753"  "765"  "775"  "778"  "815" 
##  [41] "831"  "850"  "866"  "878"  "898"  "909"  "911"  "964"  "971"  "988" 
##  [51] "990"  "1000" "1002" "1003" "1004" "1094" "1096" "1098" "1101" "1102"
##  [61] "1109" "1110" "1136" "1147" "1159" "1165" "1171" "1174" "1175" "1188"
##  [71] "1190" "1211" "1218" "1224" "1226" "1235" "1246" "1257" "1271" "1272"
##  [81] "1280" "1285" "1286" "1288" "1290" "1294" "1307" "1324" "1334" "1336"
##  [91] "1344" "1347" "1352" "1355" "1360" "1361" "1364" "1373" "1374" "1377"
## [101] "1393" "1395" "1416"
## 
## [[2]]
##   [1] "37"   "86"   "104"  "132"  "151"  "153"  "164"  "169"  "174"  "175" 
##  [11] "195"  "199"  "215"  "248"  "253"  "263"  "282"  "296"  "299"  "301" 
##  [21] "312"  "324"  "335"  "342"  "353"  "362"  "365"  "370"  "380"  "396" 
##  [31] "406"  "414"  "417"  "423"  "427"  "441"  "443"  "444"  "453"  "457" 
##  [41] "465"  "472"  "478"  "493"  "498"  "522"  "530"  "541"  "568"  "574" 
##  [51] "578"  "588"  "589"  "593"  "594"  "599"  "607"  "612"  "626"  "642" 
##  [61] "655"  "661"  "679"  "688"  "694"  "712"  "735"  "741"  "744"  "747" 
##  [71] "755"  "768"  "770"  "788"  "793"  "803"  "806"  "824"  "830"  "835" 
##  [81] "839"  "857"  "876"  "900"  "919"  "935"  "979"  "986"  "999"  "1006"
##  [91] "1015" "1019" "1021" "1031" "1033" "1037" "1039" "1040" "1043" "1049"
## [101] "1057" "1062" "1065" "1067" "1068" "1073" "1077" "1078" "1083" "1090"
## [111] "1100" "1104" "1108" "1126" "1127" "1134" "1141" "1146" "1157" "1158"
## [121] "1161" "1164" "1169" "1184" "1189" "1198" "1201" "1205" "1206" "1208"
## [131] "1217" "1223" "1227" "1228" "1229" "1239" "1241" "1243" "1245" "1252"
## [141] "1256" "1261" "1264" "1265" "1266" "1292" "1293" "1302" "1303" "1304"
## [151] "1326" "1343" "1351" "1365" "1367" "1370" "1380" "1424" "1446"
## 
## [[3]]
##  [1] "307"  "310"  "356"  "430"  "431"  "450"  "482"  "518"  "544"  "613" 
## [11] "623"  "628"  "675"  "743"  "822"  "840"  "903"  "913"  "914"  "1095"
## [21] "1129" "1153" "1166" "1179" "1183" "1193" "1199" "1204"
## 
## [[4]]
##  [1] "189"  "191"  "196"  "209"  "212"  "213"  "217"  "219"  "220"  "227" 
## [11] "228"  "245"  "252"  "255"  "267"  "268"  "280"  "284"  "288"  "294" 
## [21] "297"  "298"  "309"  "313"  "314"  "322"  "327"  "330"  "341"  "347" 
## [31] "361"  "364"  "372"  "384"  "412"  "436"  "440"  "452"  "454"  "455" 
## [41] "459"  "460"  "461"  "468"  "480"  "495"  "500"  "517"  "526"  "539" 
## [51] "546"  "555"  "558"  "666"  "678"  "693"  "708"  "715"  "729"  "758" 
## [61] "780"  "786"  "814"  "841"  "853"  "867"  "870"  "894"  "902"  "907" 
## [71] "929"  "931"  "939"  "947"  "1142" "1150" "1269"
## 
## [[5]]
##   [1] "4"    "139"  "225"  "232"  "233"  "240"  "241"  "247"  "257"  "258" 
##  [11] "264"  "265"  "266"  "285"  "289"  "300"  "302"  "311"  "323"  "328" 
##  [21] "334"  "340"  "357"  "360"  "374"  "382"  "385"  "393"  "397"  "408" 
##  [31] "409"  "418"  "420"  "421"  "437"  "439"  "442"  "456"  "458"  "470" 
##  [41] "471"  "481"  "485"  "490"  "527"  "548"  "560"  "561"  "562"  "566" 
##  [51] "597"  "598"  "639"  "652"  "670"  "671"  "676"  "685"  "710"  "711" 
##  [61] "724"  "727"  "738"  "742"  "761"  "762"  "764"  "769"  "790"  "794" 
##  [71] "802"  "804"  "817"  "819"  "820"  "833"  "836"  "846"  "847"  "877" 
##  [81] "882"  "886"  "895"  "908"  "918"  "920"  "928"  "933"  "936"  "943" 
##  [91] "948"  "951"  "953"  "956"  "1059" "1151" "1222" "1240" "1244" "1306"
## 
## [[6]]
##  [1] "31"   "149"  "162"  "243"  "256"  "279"  "398"  "492"  "529"  "531" 
## [11] "683"  "725"  "756"  "797"  "809"  "856"  "869"  "885"  "1119" "1220"
## [21] "1238" "1267" "1282" "1295" "1350"
## 
## [[7]]
##  [1] "3"    "5"    "7"    "13"   "16"   "20"   "26"   "27"   "29"   "30"  
## [11] "32"   "38"   "44"   "45"   "46"   "48"   "51"   "56"   "58"   "59"  
## [21] "63"   "64"   "66"   "72"   "73"   "75"   "79"   "80"   "82"   "84"  
## [31] "87"   "93"   "96"   "100"  "101"  "105"  "108"  "109"  "112"  "114" 
## [41] "116"  "117"  "118"  "119"  "120"  "122"  "123"  "126"  "127"  "129" 
## [51] "154"  "168"  "170"  "178"  "204"  "238"  "246"  "251"  "259"  "273" 
## [61] "290"  "317"  "320"  "339"  "348"  "349"  "368"  "371"  "373"  "401" 
## [71] "415"  "416"  "434"  "447"  "449"  "469"  "483"  "487"  "497"  "508" 
## [81] "536"  "554"  "701"  "732"  "798"  "881"  "916"  "925"  "926"  "955" 
## [91] "1058" "1388"
## 
## [[8]]
##  [1] "629"  "633"  "641"  "681"  "689"  "740"  "746"  "771"  "808"  "823" 
## [11] "858"  "890"  "944"  "967"  "974"  "1116" "1120" "1123" "1155" "1191"
## [21] "1196" "1202" "1207" "1210" "1225" "1259" "1268" "1277" "1287" "1296"
## [31] "1305" "1311" "1314" "1319" "1325" "1329" "1339" "1341" "1342" "1349"
## [41] "1354" "1358" "1359" "1362" "1371" "1389" "1390" "1396" "1398" "1401"
## [51] "1404" "1408" "1411" "1412" "1413" "1414" "1417" "1419" "1420" "1422"
## [61] "1423" "1425" "1426" "1428" "1433" "1434" "1438" "1441" "1442" "1443"
## [71] "1452" "1453" "1455" "1458" "1460"
## 
## [[9]]
##  [1] "8"    "33"   "68"   "78"   "107"  "130"  "136"  "141"  "143"  "144" 
## [11] "145"  "150"  "152"  "157"  "159"  "163"  "166"  "171"  "173"  "179" 
## [21] "181"  "182"  "188"  "190"  "202"  "226"  "260"  "270"  "276"  "306" 
## [31] "319"  "321"  "344"  "346"  "358"  "381"  "404"  "422"  "451"  "467" 
## [41] "477"  "501"  "511"  "524"  "540"  "571"  "592"  "602"  "608"  "615" 
## [51] "616"  "617"  "621"  "631"  "638"  "649"  "650"  "657"  "677"  "682" 
## [61] "690"  "692"  "699"  "722"  "739"  "757"  "759"  "760"  "791"  "792" 
## [71] "799"  "805"  "828"  "852"  "865"  "884"  "891"  "906"  "921"  "952" 
## [81] "959"  "994"  "1036" "1097" "1125" "1152" "1173" "1186" "1253" "1323"
## [91] "1429"
## 
## [[10]]
##   [1] "183"  "559"  "567"  "573"  "580"  "585"  "603"  "604"  "620"  "635" 
##  [11] "654"  "658"  "668"  "674"  "704"  "719"  "733"  "750"  "776"  "777" 
##  [21] "787"  "796"  "812"  "825"  "837"  "887"  "912"  "915"  "977"  "978" 
##  [31] "987"  "992"  "993"  "995"  "996"  "998"  "1001" "1005" "1007" "1008"
##  [41] "1009" "1010" "1011" "1014" "1023" "1026" "1029" "1034" "1035" "1038"
##  [51] "1041" "1046" "1047" "1050" "1054" "1066" "1069" "1072" "1075" "1080"
##  [61] "1081" "1084" "1086" "1087" "1091" "1106" "1113" "1122" "1130" "1131"
##  [71] "1138" "1139" "1140" "1144" "1154" "1160" "1172" "1176" "1185" "1195"
##  [81] "1213" "1214" "1249" "1260" "1270" "1276" "1291" "1299" "1317" "1353"
##  [91] "1366" "1369" "1375" "1376" "1378" "1379" "1381" "1384" "1385" "1386"
## 
## [[11]]
##   [1] "121"  "140"  "147"  "158"  "160"  "176"  "192"  "201"  "221"  "283" 
##  [11] "287"  "331"  "363"  "366"  "367"  "389"  "432"  "433"  "466"  "479" 
##  [21] "484"  "486"  "496"  "510"  "513"  "525"  "542"  "550"  "556"  "576" 
##  [31] "582"  "583"  "619"  "627"  "630"  "637"  "646"  "651"  "663"  "709" 
##  [41] "723"  "728"  "737"  "754"  "785"  "800"  "810"  "813"  "818"  "826" 
##  [51] "860"  "872"  "875"  "982"  "983"  "984"  "1022" "1051" "1085" "1099"
##  [61] "1103" "1111" "1112" "1115" "1118" "1121" "1124" "1128" "1132" "1135"
##  [71] "1143" "1145" "1149" "1162" "1167" "1177" "1181" "1203" "1209" "1212"
##  [81] "1215" "1231" "1236" "1237" "1247" "1250" "1251" "1262" "1263" "1274"
##  [91] "1281" "1300" "1301" "1313" "1322" "1331" "1338" "1340" "1348" "1356"
## [101] "1372" "1391" "1427" "1431"
## 
## [[12]]
##   [1] "2"    "6"    "9"    "10"   "11"   "12"   "14"   "15"   "17"   "18"  
##  [11] "19"   "21"   "22"   "23"   "24"   "25"   "28"   "34"   "35"   "36"  
##  [21] "39"   "40"   "41"   "42"   "43"   "47"   "49"   "50"   "52"   "53"  
##  [31] "54"   "55"   "57"   "60"   "61"   "62"   "65"   "67"   "69"   "70"  
##  [41] "71"   "74"   "76"   "77"   "81"   "83"   "85"   "88"   "89"   "90"  
##  [51] "91"   "92"   "94"   "95"   "97"   "98"   "99"   "102"  "103"  "106" 
##  [61] "110"  "111"  "113"  "115"  "124"  "125"  "128"  "131"  "133"  "134" 
##  [71] "135"  "137"  "138"  "148"  "155"  "156"  "161"  "165"  "167"  "177" 
##  [81] "180"  "184"  "185"  "186"  "187"  "193"  "194"  "198"  "200"  "203" 
##  [91] "205"  "206"  "207"  "208"  "210"  "214"  "216"  "218"  "223"  "224" 
## [101] "229"  "230"  "231"  "234"  "235"  "236"  "237"  "239"  "242"  "244" 
## [111] "249"  "250"  "254"  "261"  "269"  "271"  "272"  "274"  "275"  "277" 
## [121] "278"  "281"  "286"  "291"  "292"  "293"  "295"  "303"  "304"  "305" 
## [131] "308"  "315"  "316"  "318"  "325"  "329"  "333"  "336"  "337"  "338" 
## [141] "350"  "351"  "352"  "355"  "359"  "369"  "375"  "378"  "383"  "386" 
## [151] "387"  "388"  "390"  "391"  "392"  "394"  "395"  "399"  "400"  "402" 
## [161] "403"  "405"  "407"  "411"  "419"  "424"  "425"  "426"  "428"  "429" 
## [171] "435"  "438"  "445"  "446"  "448"  "462"  "464"  "473"  "475"  "476" 
## [181] "488"  "489"  "491"  "499"  "502"  "503"  "504"  "505"  "506"  "509" 
## [191] "514"  "516"  "519"  "520"  "523"  "532"  "533"  "534"  "537"  "543" 
## [201] "545"  "547"  "549"  "551"  "552"  "553"  "557"  "563"  "564"  "569" 
## [211] "570"  "572"  "575"  "579"  "581"  "584"  "586"  "587"  "590"  "591" 
## [221] "595"  "596"  "600"  "601"  "605"  "609"  "610"  "611"  "618"  "625" 
## [231] "632"  "634"  "636"  "643"  "644"  "648"  "659"  "660"  "665"  "667" 
## [241] "669"  "672"  "673"  "684"  "686"  "687"  "691"  "695"  "696"  "698" 
## [251] "700"  "703"  "705"  "706"  "713"  "714"  "716"  "717"  "718"  "720" 
## [261] "721"  "726"  "730"  "734"  "745"  "748"  "749"  "751"  "763"  "766" 
## [271] "767"  "772"  "773"  "774"  "779"  "781"  "782"  "783"  "789"  "795" 
## [281] "801"  "807"  "811"  "816"  "821"  "827"  "829"  "832"  "834"  "838" 
## [291] "843"  "844"  "845"  "848"  "849"  "851"  "854"  "855"  "859"  "861" 
## [301] "862"  "864"  "868"  "871"  "874"  "879"  "880"  "883"  "888"  "892" 
## [311] "893"  "896"  "897"  "899"  "901"  "904"  "905"  "910"  "917"  "922" 
## [321] "923"  "924"  "927"  "930"  "932"  "934"  "937"  "938"  "940"  "941" 
## [331] "942"  "945"  "946"  "949"  "950"  "954"  "957"  "958"  "960"  "961" 
## [341] "962"  "963"  "965"  "966"  "968"  "969"  "970"  "972"  "973"  "975" 
## [351] "976"  "980"  "981"  "985"  "989"  "991"  "997"  "1012" "1013" "1016"
## [361] "1017" "1018" "1020" "1024" "1025" "1027" "1028" "1030" "1032" "1042"
## [371] "1044" "1045" "1048" "1052" "1053" "1055" "1056" "1060" "1061" "1063"
## [381] "1064" "1070" "1071" "1074" "1076" "1079" "1082" "1088" "1089" "1092"
## [391] "1093" "1105" "1107" "1117" "1137" "1148" "1156" "1163" "1168" "1170"
## [401] "1178" "1180" "1182" "1187" "1194" "1197" "1216" "1221" "1230" "1232"
## [411] "1233" "1242" "1254" "1255" "1258" "1273" "1275" "1278" "1279" "1283"
## [421] "1284" "1297" "1298" "1308" "1309" "1312" "1315" "1316" "1318" "1320"
## [431] "1321" "1327" "1328" "1330" "1332" "1333" "1335" "1337" "1346" "1357"
## [441] "1363" "1382" "1383" "1387" "1392" "1394" "1397" "1399" "1400" "1402"
## [451] "1403" "1405" "1406" "1407" "1409" "1410" "1415" "1418" "1421" "1430"
## [461] "1432" "1435" "1436" "1437" "1439" "1440" "1444" "1445" "1447" "1448"
## [471] "1449" "1450" "1451" "1454" "1456" "1457" "1459"
## 
## [[13]]
##  [1] "377"  "463"  "474"  "507"  "538"  "640"  "647"  "662"  "680"  "702" 
## [11] "707"  "731"  "784"  "842"  "863"  "873"  "889"  "1114" "1133" "1192"
## [21] "1200" "1219" "1234" "1248" "1289" "1310" "1345" "1368"

library(igraph)
g <- as.undirected(graph.adjacency(topic_df_dist))
layout1 <- layout.fruchterman.reingold(g, niter=500)
plot(g, layout=layout1, edge.curved = TRUE, vertex.size = 1, vertex.color= "grey", edge.arrow.size = 0, vertex.label.dist=0.5, vertex.label = NA)

write.graph(g, file="oc2.graphml", format="graphml")

For the Textplot output, please see my repo here. There will also be an interactive version of the network above, in that repository (along with a key for retrieving the original diaries.)

Topic Modeling Archaeological Diaries

Shawn Graham

May 15, 2015