Topic Modeling Archaeological Diaries

This is a continuation of my exploration of site diaries housed at Open Context.

The 1400 diaries then that I topic model in R I am also exploring with Textplot. Texplot eventually produces a network graph; modularity of that graph finds 13 very strong groups. Accordingly, I first topic model’d for 13 topics.

I want to see how topics via topic models compare versus topics via Textplot. Upshot: the resulting model was not very useful. So I’m going back and refitting a model with more meat to it.

setwd("/Users/shawngraham/Desktop/data mining and tools/open-context/")

Then we import our data:

#Rio makes this easy. I'm assuming you know how to install packages.

library("rio")

##importing opencontext site diaries
documents <- import("diaries-in-date-order-R.csv")

Then we set up Mallet in R. We need as much memory as we’ve got, so:

options(java.parameters = "-Xmx5120m")
library(rJava)
## from http://cran.r-project.org/web/packages/mallet/mallet.pdf
library(mallet)

Now we pass the diaries to Mallet. There’s a bit of iteration here; I ran the entire topic modeling code until I got a list of the words and their frequencies (which you’ll see further below); the most frequent words are frequent by orders of magnitude, so I insert them into my stopword list so that they don’t overwhelm the analysis.

mallet.instances <- mallet.import(as.character(documents$id), as.character(documents$text), "/Users/shawngraham/Desktop/data mining and tools/TextAnalysisWithR/data/stoplist2.csv", FALSE, token.regexp="\\p{L}[\\p{L}\\p{P}]+\\p{L}")

#set the number of desired topics
num.topics <- 40
topic.model <- MalletLDA(num.topics)

## Load our documents. We could also pass in the filename of a
## saved instance list file that we build from the command-line tools.
topic.model$loadDocuments(mallet.instances)
## Get the vocabulary, and some statistics about word frequencies.
## These may be useful in further curating the stopword list.
vocabulary <- topic.model$getVocabulary()
word.freqs <- mallet.word.freqs(topic.model)
head(word.freqs)

##      words term.freq doc.freq
## 1     july       133      115
## 2  working       188      156
## 3 measured        55       44
## 4     side       967      404
## 5     hill        60       43
## 6  prickly         2        2

# write.csv(word.freqs, "oc-word-freqs.csv" ) <- study this file, use it to modify your stoplist file

So let’s see what we get:

## Optimize hyperparameters every 20 iterations,
## after 50 burn-in iterations.
topic.model$setAlphaOptimization(20, 50)
## Now train a model. Note that hyperparameter optimization is on, by default.
## We can specify the number of iterations. Here we'll use a large-ish round number.
topic.model$train(1000)
## NEW: run through a few iterations where we pick the best topic for each token,
## rather than sampling from the posterior distribution.
topic.model$maximize(10)
## Get the probability of topics in documents and the probability of words in topics.
## By default, these functions return raw word counts. Here we want probabilities,
## so we normalize, and add "smoothing" so that nothing has exactly 0 probability.
doc.topics <- mallet.doc.topics(topic.model, smoothed=T, normalized=T)
topic.words <- mallet.topic.words(topic.model, smoothed=T, normalized=T)
# save(doc.topics, file = "ocdoctopics.RData") <-we need this for doing self-organized maps. We'll worry about these later.

Let’s see what the topic labels look like.

###from my other script; above was mimno's example script
topic.docs <- t(doc.topics)
topic.docs <- topic.docs / rowSums(topic.docs)
write.csv(topic.docs, "oc-topics-docs.csv" ) ## "C:\\Malletopic-docs.csv"
## Get a vector containing short names for the topics
topics.labels <- rep("", num.topics)
for (topic in 1:num.topics) topics.labels[topic] <- paste(mallet.top.words(topic.model, topic.words[topic,], num.top.words=15)$words, collapse=" ")
# have a look at keywords for each topic
topics.labels

##  [1] "today daily day work july loci tomorrow continue plan end dirt cleaning june taking continued"                             
##  [2] "season excavated excavation loci year end time surfaces layers pits cut levels point final areas"                          
##  [3] "began foundation debris dirt clay large finding potsherds floor uncovered pot spent continued lot finds"                   
##  [4] "topsoil step top corner july opened meter meters slope located opening tepe dimensions loam loose"                         
##  [5] "rocks side number july part started tomorrow located day find mud-brick elevations process taking floor"                   
##  [6] "corner ash brown excavated light small clay mud compacted painter sounding july coarse opened fine"                        
##  [7] "stones oven part cobbles cobble surfaces dug find large stone north south east digging feature"                            
##  [8] "work mud brick loci north excavation debris southern july white daily western surfaces organic slump"                      
##  [9] "mudbrick plaster ash cut peg base line top step section edge underneath debris half red"                                   
## [10] "mud brick bricks material side collapse layer walls june room top floor collapsed thin burned"                             
## [11] "mudbrick walls plan layer section loci pits opened decided august excavate continue began ash stones"                      
## [12] "oven layer ash removed bricks ashy line mudbricks orange eppihimer part mudbrick inside construction platform"             
## [13] "half pot dug eastern rocks lines section balk large mud august appeared southern brick sherds"                             
## [14] "pot small sherds large pottery stone layer debris pieces bone rim top continued ground began"                              
## [15] "loci elevation debris structure surfaces cms burial opened small taking cobbled continued interior lot collected"          
## [16] "meters began small located baldi collected tepe meter time lower point areas good materials work"                          
## [17] "rocks side west compacted flat east line large pottery balk decided layer mud-brick loose corner"                          
## [18] "mudbrick feature grain ubaid portion structure large context appears finally contexts possibly material good collapse"     
## [19] "floor loci large began clear matrix point decided balk visible pick picks appeared designated big"                         
## [20] "started excavated pebble baris walls uzel door daily big inside today rock order part find"                                
## [21] "began house collapsed ubaid barish street discovered corner removed eastern loci extension uzel earlier portion"           
## [22] "loc oven rocks today workers august plan opened cms bottom continuation articulate back push color"                        
## [23] "half eastern corner southern northern mudbrick removed western stone entire beneath pottery stones uncovered extend"       
## [24] "sounding bottom started excavated excavating half mudbrick small silty august bit bone arbitrary pottery artifacts"        
## [25] "age yrs recorded root estimate dental bag based finds complete development crown shows visible estimation"                 
## [26] "oven large ashy small black layer underneath ended context pavement ash finds days year appears"                           
## [27] "started digging pottery small part dig today day corner rest find decided top elevations workers"                          
## [28] "ash layer section ashy hard clay cut dirt color sounding began soft mudbrick brown virgin"                                 
## [29] "walls features feature located plaster meters structure inside building southeast abutting directly included eastern based"
## [30] "burial bones bone human skull burials skeleton head left fragments face exposed long animal position"                      
## [31] "part mud daily backus foundation stone today floor sounding diana corner stones structure amount loci"                     
## [32] "part dug pieces number south elevations pottery east north cleaned date plaster page digging kenantepe"                    
## [33] "rocks side rock corner pot dirt pottery middle lot piece cut white southeast line stuff"                                   
## [34] "south north east west side section removed bricks brick appears lines running edge line wide"                              
## [35] "section layer ubaid material oven excavated noted house features mud northern ashy domestic decided corner"                
## [36] "meter layer meters rocks cobbles white long feature river courses bricks number extend material features"                  
## [37] "burials burial end removed rock large identified part material human additional block exposed piece work"                  
## [38] "sample pottery carbon samples hap removed removal finds cultural pebble complete collected recovered excavation bone"      
## [39] "week weekly walls features oven rock july pottery feature age opening related iron ending slag"                            
## [40] "furnace slag debri large material small hard chamber continued finding structure chambers excavated samples piece"

And if we turn them into word clouds, we get a sense of the relative importance of words that appear in different topics.

## Loading required package: RColorBrewer

## NULL

## NULL

## NULL

## NULL

## NULL

## NULL

## NULL

## NULL

## NULL

## NULL

## NULL

## NULL

## NULL

## NULL

## NULL

## NULL

## NULL

## NULL

## NULL

## NULL

## NULL

## NULL

## NULL

## NULL

## NULL

## NULL

## NULL

## NULL

## NULL

## NULL

## NULL

## NULL

## NULL

## NULL

## NULL

## NULL

## NULL

## NULL

## NULL

## NULL

##generate clusters of diaries that are similar in their distribution of topics
topic_docs <- data.frame(topic.docs)
names(topic_docs) <- documents$id

## cluster based on shared words
plot(hclust(dist(topic.words)), labels=topics.labels)

library(cluster)
topic_df_dist <- as.matrix(daisy(t(topic_docs), metric = "euclidean", stand = TRUE))
# Change row values to zero if less than row minimum plus row standard deviation
# keep only closely related documents and avoid a dense spagetti diagram
# that's difficult to interpret (hat-tip: http://stackoverflow.com/a/16047196/1036500)
topic_df_dist[ sweep(topic_df_dist, 1, (apply(topic_df_dist,1,min) + apply(topic_df_dist,1,sd) )) > 0 ] <- 0

#' Use kmeans to identify groups of similar diaries

km <- kmeans(topic_df_dist, num.topics)

# get names for each cluster
allnames <- vector("list", length = num.topics)
for(i in 1:num.topics){
  allnames[[i]] <- names(km$cluster[km$cluster == i])
}
allnames

## [[1]]
##  [1] "230"  "311"  "357"  "360"  "794"  "836"  "918"  "920"  "953"  "956" 
## [11] "960"  "961"  "1243" "1332"
## 
## [[2]]
##  [1] "153"  "164"  "169"  "297"  "299"  "312"  "366"  "396"  "457"  "513" 
## [11] "593"  "809"  "875"  "909"  "911"  "919"  "927"  "935"  "1020" "1143"
## [21] "1151" "1167" "1184" "1222" "1229" "1252" "1265" "1274" "1293" "1303"
## [31] "1304" "1436"
## 
## [[3]]
##  [1] "141" "144" "148" "151" "154" "187" "211" "232" "237" "254" "279"
## [12] "340" "347" "367" "444" "575" "583" "584" "601" "639" "804" "820"
## 
## [[4]]
##  [1] "501"  "663"  "732"  "741"  "851"  "1058" "1113" "1118" "1185" "1276"
## [11] "1337" "1350" "1378"
## 
## [[5]]
##  [1] "139" "241" "257" "258" "264" "285" "385" "408" "409" "420" "421"
## [12] "438" "439" "442" "456" "458" "481" "548" "676" "685" "711" "724"
## [23] "738" "742" "761" "762" "764" "790" "817" "877" "882" "895" "908"
## [34] "928" "951"
## 
## [[6]]
##  [1] "224"  "378"  "486"  "805"  "826"  "892"  "979"  "986"  "1099" "1124"
## [11] "1135" "1177" "1191" "1194" "1211" "1223" "1228" "1231" "1255" "1262"
## [21] "1266" "1307" "1367" "1380" "1391"
## 
## [[7]]
##  [1] "152" "185" "206" "222" "234" "250" "271" "308" "319" "332" "345"
## [12] "346" "355" "379" "390" "400" "413" "428" "467" "475" "491" "506"
## [23] "528" "532" "557" "590" "602" "637" "657" "698" "700" "721" "778"
## [34] "803" "815" "866" "962"
## 
## [[8]]
##  [1] "121"  "984"  "994"  "997"  "1021" "1033" "1065" "1082" "1094" "1161"
## [11] "1178" "1218" "1224" "1237" "1246" "1281" "1288" "1301" "1326"
## 
## [[9]]
##  [1] "199"  "358"  "373"  "401"  "403"  "416"  "479"  "531"  "541"  "582" 
## [11] "622"  "736"  "828"  "868"  "878"  "955"  "1238" "1295"
## 
## [[10]]
##  [1] "433"  "451"  "585"  "619"  "621"  "677"  "690"  "759"  "839"  "1111"
## [11] "1126" "1131" "1140" "1176" "1213" "1270" "1345" "1364" "1379"
## 
## [[11]]
##   [1] "2"    "3"    "4"    "5"    "7"    "8"    "10"   "11"   "13"   "14"  
##  [11] "16"   "17"   "18"   "20"   "21"   "22"   "23"   "25"   "26"   "27"  
##  [21] "29"   "30"   "32"   "35"   "36"   "37"   "38"   "41"   "43"   "44"  
##  [31] "45"   "46"   "47"   "48"   "49"   "51"   "52"   "53"   "54"   "55"  
##  [41] "56"   "58"   "59"   "60"   "62"   "63"   "64"   "65"   "66"   "69"  
##  [51] "71"   "72"   "73"   "74"   "75"   "77"   "79"   "80"   "81"   "82"  
##  [61] "83"   "84"   "85"   "87"   "88"   "90"   "91"   "93"   "95"   "96"  
##  [71] "98"   "99"   "100"  "103"  "104"  "105"  "108"  "109"  "111"  "113" 
##  [81] "114"  "116"  "117"  "118"  "119"  "120"  "122"  "123"  "124"  "125" 
##  [91] "127"  "128"  "132"  "133"  "135"  "138"  "146"  "149"  "150"  "155" 
## [101] "156"  "159"  "161"  "162"  "163"  "168"  "171"  "175"  "177"  "178" 
## [111] "181"  "184"  "193"  "194"  "198"  "201"  "203"  "205"  "207"  "209" 
## [121] "212"  "217"  "218"  "221"  "225"  "231"  "233"  "240"  "242"  "243" 
## [131] "244"  "247"  "251"  "256"  "267"  "269"  "272"  "274"  "275"  "276" 
## [141] "277"  "278"  "281"  "286"  "287"  "288"  "291"  "292"  "293"  "300" 
## [151] "303"  "307"  "313"  "314"  "315"  "316"  "325"  "326"  "329"  "331" 
## [161] "333"  "336"  "344"  "348"  "350"  "351"  "352"  "354"  "361"  "363" 
## [171] "386"  "391"  "392"  "402"  "406"  "410"  "412"  "417"  "422"  "425" 
## [181] "426"  "429"  "435"  "440"  "441"  "443"  "445"  "446"  "448"  "463" 
## [191] "470"  "473"  "476"  "478"  "482"  "489"  "492"  "499"  "503"  "504" 
## [201] "505"  "509"  "512"  "516"  "518"  "519"  "534"  "537"  "540"  "543" 
## [211] "544"  "545"  "547"  "549"  "550"  "551"  "552"  "553"  "556"  "559" 
## [221] "563"  "566"  "567"  "569"  "573"  "576"  "580"  "587"  "595"  "596" 
## [231] "600"  "603"  "605"  "606"  "609"  "617"  "620"  "623"  "628"  "629" 
## [241] "631"  "633"  "634"  "635"  "636"  "638"  "641"  "644"  "645"  "649" 
## [251] "653"  "654"  "659"  "664"  "665"  "668"  "673"  "674"  "681"  "686" 
## [261] "687"  "689"  "692"  "695"  "696"  "697"  "699"  "701"  "704"  "705" 
## [271] "706"  "710"  "713"  "715"  "716"  "717"  "719"  "722"  "726"  "728" 
## [281] "733"  "734"  "739"  "740"  "746"  "747"  "748"  "749"  "750"  "751" 
## [291] "758"  "760"  "766"  "767"  "768"  "771"  "772"  "773"  "775"  "776" 
## [301] "777"  "779"  "781"  "782"  "787"  "788"  "789"  "795"  "796"  "799" 
## [311] "800"  "801"  "802"  "806"  "808"  "810"  "812"  "816"  "818"  "821" 
## [321] "822"  "823"  "825"  "829"  "831"  "834"  "835"  "837"  "840"  "845" 
## [331] "848"  "849"  "854"  "858"  "859"  "861"  "862"  "864"  "871"  "879" 
## [341] "885"  "887"  "889"  "890"  "896"  "897"  "898"  "903"  "912"  "913" 
## [351] "914"  "915"  "926"  "929"  "930"  "931"  "932"  "934"  "941"  "944" 
## [361] "946"  "949"  "954"  "957"  "964"  "965"  "966"  "967"  "969"  "972" 
## [371] "974"  "977"  "978"  "980"  "981"  "982"  "983"  "987"  "988"  "989" 
## [381] "990"  "992"  "993"  "995"  "1000" "1001" "1002" "1003" "1005" "1006"
## [391] "1008" "1009" "1010" "1011" "1012" "1013" "1014" "1016" "1017" "1018"
## [401] "1019" "1023" "1024" "1025" "1027" "1028" "1029" "1034" "1035" "1038"
## [411] "1041" "1042" "1044" "1045" "1046" "1047" "1048" "1050" "1052" "1053"
## [421] "1054" "1055" "1056" "1061" "1063" "1064" "1066" "1069" "1070" "1071"
## [431] "1072" "1074" "1075" "1076" "1079" "1080" "1081" "1084" "1086" "1087"
## [441] "1088" "1089" "1090" "1091" "1092" "1093" "1098" "1101" "1104" "1109"
## [451] "1115" "1117" "1119" "1121" "1122" "1127" "1138" "1147" "1148" "1155"
## [461] "1156" "1160" "1163" "1166" "1169" "1172" "1183" "1189" "1193" "1195"
## [471] "1196" "1200" "1201" "1209" "1214" "1221" "1226" "1230" "1233" "1234"
## [481] "1247" "1254" "1257" "1258" "1259" "1263" "1264" "1268" "1271" "1275"
## [491] "1277" "1278" "1279" "1283" "1284" "1285" "1287" "1290" "1298" "1299"
## [501] "1306" "1309" "1311" "1312" "1317" "1318" "1319" "1321" "1330" "1333"
## [511] "1335" "1341" "1347" "1348" "1353" "1360" "1366" "1374" "1375" "1381"
## [521] "1382" "1383" "1385" "1386" "1387" "1388" "1389" "1396" "1397" "1398"
## [531] "1399" "1401" "1404" "1408" "1410" "1411" "1413" "1414" "1418" "1420"
## [541] "1421" "1422" "1425" "1430" "1433" "1434" "1437" "1438" "1441" "1446"
## [551] "1449" "1458" "1459"
## 
## [[12]]
##  [1] "165" "180" "186" "192" "208" "216" "223" "236" "318" "387" "399"
## [12] "411" "424" "462" "488" "514" "523" "572" "581" "610" "625" "632"
## [23] "648" "660" "691" "774" "811" "904" "940" "968" "973"
## 
## [[13]]
##  [1] "170" "249" "290" "317" "339" "359" "371" "449" "472" "483" "487"
## [12] "554" "672" "745" "765" "916" "945" "950" "963" "970"
## 
## [[14]]
##  [1] "1004" "1102" "1136" "1180" "1187" "1269" "1272" "1273" "1324" "1371"
## [11] "1395" "1455"
## 
## [[15]]
##  [1] "1390" "1392" "1394" "1400" "1402" "1403" "1405" "1406" "1407" "1409"
## [11] "1412" "1415" "1419" "1435" "1439" "1444" "1445" "1448" "1450" "1451"
## [21] "1454" "1456" "1457"
## 
## [[16]]
##  [1] "50"   "196"  "213"  "255"  "294"  "309"  "322"  "452"  "480"  "500" 
## [11] "502"  "517"  "526"  "539"  "546"  "558"  "708"  "729"  "780"  "841" 
## [21] "1150"
## 
## [[17]]
##  [1] "40"   "92"   "137"  "191"  "227"  "229"  "330"  "372"  "384"  "405" 
## [11] "436"  "454"  "468"  "495"  "555"  "655"  "786"  "814"  "853"  "894" 
## [21] "902"  "910"  "947"  "1316"
## 
## [[18]]
##  [1] "126"  "140"  "145"  "182"  "195"  "226"  "248"  "389"  "437"  "453" 
## [11] "477"  "542"  "886"  "1073" "1100" "1103"
## 
## [[19]]
##  [1] "328"  "382"  "393"  "418"  "471"  "490"  "652"  "670"  "671"  "727" 
## [11] "769"  "847"  "937"  "943"  "948"  "1157" "1188" "1205"
## 
## [[20]]
##  [1] "999"  "1015" "1022" "1031" "1037" "1039" "1049" "1057" "1062" "1077"
## [11] "1085" "1132" "1139" "1145" "1149" "1162" "1171" "1182" "1190" "1236"
## [21] "1250" "1294" "1300" "1313" "1322" "1331" "1338" "1356" "1365" "1373"
## [31] "1377"
## 
## [[21]]
##  [1] "189" "200" "219" "220" "228" "235" "245" "252" "261" "268" "280"
## [12] "284" "298" "327" "338" "341" "362" "364" "369" "375" "383" "394"
## [23] "419" "455" "459" "460" "461" "666" "678" "693" "867" "870" "880"
## [34] "888" "899" "907" "939"
## 
## [[22]]
##  [1] "160"  "188"  "215"  "253"  "263"  "282"  "296"  "324"  "335"  "353" 
## [11] "370"  "380"  "398"  "414"  "432"  "465"  "484"  "493"  "498"  "515" 
## [21] "568"  "571"  "574"  "578"  "588"  "592"  "607"  "608"  "612"  "615" 
## [31] "626"  "650"  "651"  "661"  "694"  "709"  "723"  "735"  "744"  "755" 
## [41] "813"  "824"  "852"  "857"  "876"  "900"  "1030" "1036" "1040" "1043"
## [51] "1051" "1060" "1068" "1078" "1083" "1240" "1256" "1280" "1343" "1351"
## [61] "1357" "1363" "1370"
## 
## [[23]]
##  [1] "190" "238" "259" "273" "304" "415" "434" "464" "469" "497" "508"
## [12] "533" "579" "586" "591" "597" "598" "611" "730" "753" "763" "855"
## [23] "883" "893" "924" "958" "975"
## 
## [[24]]
## [1] "560" "561" "562" "846"
## 
## [[25]]
##  [1] "6"   "9"   "12"  "15"  "19"  "24"  "28"  "34"  "39"  "42"  "57" 
## [12] "61"  "67"  "76"  "102" "106" "131"
## 
## [[26]]
##  [1] "262" "283" "427" "447" "496" "510" "525" "564" "565" "570" "577"
## [12] "627" "754" "798" "838" "872" "971"
## 
## [[27]]
##  [1] "158"  "183"  "246"  "682"  "791"  "1128" "1134" "1141" "1146" "1158"
## [11] "1175" "1206" "1212" "1227" "1245" "1327"
## 
## [[28]]
##  [1] "265"  "266"  "289"  "302"  "323"  "334"  "374"  "397"  "407"  "485" 
## [11] "520"  "527"  "669"  "819"  "933"  "936"  "1244"
## 
## [[29]]
##  [1] "143"  "147"  "167"  "197"  "204"  "214"  "260"  "521"  "630"  "646" 
## [11] "752"  "832"  "844"  "938"  "1323"
## 
## [[30]]
##  [1] "86"   "112"  "176"  "301"  "365"  "599"  "688"  "712"  "770"  "793" 
## [11] "1110" "1320"
## 
## [[31]]
##  [1] "239"  "395"  "833"  "1170" "1202" "1296" "1315" "1328" "1336" "1423"
## [11] "1440" "1442" "1447" "1452"
## 
## [[32]]
##  [1] "305" "320" "337" "349" "368" "376" "589" "594" "604" "642" "658"
## [12] "679" "737" "785" "830" "860" "881" "922"
## 
## [[33]]
##  [1] "172"  "343"  "614"  "624"  "656"  "718"  "720"  "850"  "905"  "923" 
## [11] "1007" "1106" "1116" "1123" "1130" "1144" "1154" "1159" "1165" "1174"
## [21] "1197" "1232" "1235" "1249" "1291" "1297" "1325" "1334" "1344" "1352"
## [31] "1355" "1361" "1376" "1417"
## 
## [[34]]
##  [1] "142"  "174"  "342"  "423"  "494"  "522"  "524"  "530"  "535"  "618" 
## [11] "643"  "1112" "1215" "1217" "1241" "1260" "1340" "1349" "1359" "1362"
## [21] "1384" "1393" "1416" "1424"
## 
## [[35]]
##  [1] "1097" "1105" "1108" "1120" "1125" "1137" "1142" "1152" "1164" "1173"
## [11] "1181" "1186" "1198" "1208" "1210" "1239" "1253" "1261" "1292" "1302"
## [21] "1305" "1314" "1329" "1339" "1342" "1346" "1354" "1358" "1372" "1427"
## [31] "1429" "1431"
## 
## [[36]]
##  [1] "33"  "68"  "70"  "78"  "89"  "94"  "97"  "107" "110" "115" "129"
## [12] "130" "134" "136" "157" "166" "173" "179" "202" "210" "270" "295"
## [23] "321" "381" "404" "511"
## 
## [[37]]
##  [1] "31"   "310"  "356"  "377"  "388"  "430"  "431"  "450"  "474"  "507" 
## [11] "529"  "538"  "613"  "640"  "647"  "662"  "667"  "675"  "680"  "683" 
## [21] "702"  "707"  "725"  "731"  "743"  "756"  "784"  "797"  "842"  "856" 
## [31] "863"  "869"  "873"  "1095" "1107" "1114" "1129" "1133" "1153" "1168"
## [41] "1179" "1192" "1199" "1203" "1204" "1216" "1219" "1220" "1242" "1248"
## [51] "1267" "1282" "1289" "1310" "1368"
## 
## [[38]]
##  [1] "101"  "306"  "466"  "536"  "783"  "921"  "925"  "976"  "985"  "991" 
## [11] "996"  "998"  "1026" "1032" "1059" "1067" "1096" "1286" "1369"
## 
## [[39]]
##  [1] "1207" "1225" "1251" "1308" "1426" "1428" "1432" "1443" "1453" "1460"
## 
## [[40]]
##  [1] "616" "684" "703" "714" "757" "792" "807" "827" "843" "865" "874"
## [12] "884" "891" "901" "906" "917" "942" "952" "959"

library(igraph)
g <- as.undirected(graph.adjacency(topic_df_dist))
layout1 <- layout.fruchterman.reingold(g, niter=500)
plot(g, layout=layout1, edge.curved = TRUE, vertex.size = 1, vertex.color= "grey", edge.arrow.size = 0, vertex.label.dist=0.5, vertex.label = NA)

write.graph(g, file="oc2.graphml", format="graphml")

For the Textplot output, please see my repo here. There will also be an interactive version of the network above, in that repository (along with a key for retrieving the original diaries.)

Topic Modeling Archaeological Diaries

Shawn Graham

May 15, 2015