This is a continuation of my exploration of site diaries housed at Open Context.
The 1400 diaries then that I topic model in R I am also exploring with Textplot. Texplot eventually produces a network graph; modularity of that graph finds 13 very strong groups. Accordingly, I first topic model’d for 13 topics.
I want to see how topics via topic models compare versus topics via Textplot. Upshot: the resulting model was not very useful. So I’m going back and refitting a model with more meat to it.
setwd("/Users/shawngraham/Desktop/data mining and tools/open-context/")
Then we import our data:
#Rio makes this easy. I'm assuming you know how to install packages.
library("rio")
##importing opencontext site diaries
documents <- import("diaries-in-date-order-R.csv")
Then we set up Mallet in R. We need as much memory as we’ve got, so:
options(java.parameters = "-Xmx5120m")
library(rJava)
## from http://cran.r-project.org/web/packages/mallet/mallet.pdf
library(mallet)
Now we pass the diaries to Mallet. There’s a bit of iteration here; I ran the entire topic modeling code until I got a list of the words and their frequencies (which you’ll see further below); the most frequent words are frequent by orders of magnitude, so I insert them into my stopword list so that they don’t overwhelm the analysis.
mallet.instances <- mallet.import(as.character(documents$id), as.character(documents$text), "/Users/shawngraham/Desktop/data mining and tools/TextAnalysisWithR/data/stoplist2.csv", FALSE, token.regexp="\\p{L}[\\p{L}\\p{P}]+\\p{L}")
#set the number of desired topics
num.topics <- 40
topic.model <- MalletLDA(num.topics)
## Load our documents. We could also pass in the filename of a
## saved instance list file that we build from the command-line tools.
topic.model$loadDocuments(mallet.instances)
## Get the vocabulary, and some statistics about word frequencies.
## These may be useful in further curating the stopword list.
vocabulary <- topic.model$getVocabulary()
word.freqs <- mallet.word.freqs(topic.model)
head(word.freqs)
## words term.freq doc.freq
## 1 july 133 115
## 2 working 188 156
## 3 measured 55 44
## 4 side 967 404
## 5 hill 60 43
## 6 prickly 2 2
# write.csv(word.freqs, "oc-word-freqs.csv" ) <- study this file, use it to modify your stoplist file
So let’s see what we get:
## Optimize hyperparameters every 20 iterations,
## after 50 burn-in iterations.
topic.model$setAlphaOptimization(20, 50)
## Now train a model. Note that hyperparameter optimization is on, by default.
## We can specify the number of iterations. Here we'll use a large-ish round number.
topic.model$train(1000)
## NEW: run through a few iterations where we pick the best topic for each token,
## rather than sampling from the posterior distribution.
topic.model$maximize(10)
## Get the probability of topics in documents and the probability of words in topics.
## By default, these functions return raw word counts. Here we want probabilities,
## so we normalize, and add "smoothing" so that nothing has exactly 0 probability.
doc.topics <- mallet.doc.topics(topic.model, smoothed=T, normalized=T)
topic.words <- mallet.topic.words(topic.model, smoothed=T, normalized=T)
# save(doc.topics, file = "ocdoctopics.RData") <-we need this for doing self-organized maps. We'll worry about these later.
Let’s see what the topic labels look like.
###from my other script; above was mimno's example script
topic.docs <- t(doc.topics)
topic.docs <- topic.docs / rowSums(topic.docs)
write.csv(topic.docs, "oc-topics-docs.csv" ) ## "C:\\Malletopic-docs.csv"
## Get a vector containing short names for the topics
topics.labels <- rep("", num.topics)
for (topic in 1:num.topics) topics.labels[topic] <- paste(mallet.top.words(topic.model, topic.words[topic,], num.top.words=15)$words, collapse=" ")
# have a look at keywords for each topic
topics.labels
## [1] "today daily day work july loci tomorrow continue plan end dirt cleaning june taking continued"
## [2] "season excavated excavation loci year end time surfaces layers pits cut levels point final areas"
## [3] "began foundation debris dirt clay large finding potsherds floor uncovered pot spent continued lot finds"
## [4] "topsoil step top corner july opened meter meters slope located opening tepe dimensions loam loose"
## [5] "rocks side number july part started tomorrow located day find mud-brick elevations process taking floor"
## [6] "corner ash brown excavated light small clay mud compacted painter sounding july coarse opened fine"
## [7] "stones oven part cobbles cobble surfaces dug find large stone north south east digging feature"
## [8] "work mud brick loci north excavation debris southern july white daily western surfaces organic slump"
## [9] "mudbrick plaster ash cut peg base line top step section edge underneath debris half red"
## [10] "mud brick bricks material side collapse layer walls june room top floor collapsed thin burned"
## [11] "mudbrick walls plan layer section loci pits opened decided august excavate continue began ash stones"
## [12] "oven layer ash removed bricks ashy line mudbricks orange eppihimer part mudbrick inside construction platform"
## [13] "half pot dug eastern rocks lines section balk large mud august appeared southern brick sherds"
## [14] "pot small sherds large pottery stone layer debris pieces bone rim top continued ground began"
## [15] "loci elevation debris structure surfaces cms burial opened small taking cobbled continued interior lot collected"
## [16] "meters began small located baldi collected tepe meter time lower point areas good materials work"
## [17] "rocks side west compacted flat east line large pottery balk decided layer mud-brick loose corner"
## [18] "mudbrick feature grain ubaid portion structure large context appears finally contexts possibly material good collapse"
## [19] "floor loci large began clear matrix point decided balk visible pick picks appeared designated big"
## [20] "started excavated pebble baris walls uzel door daily big inside today rock order part find"
## [21] "began house collapsed ubaid barish street discovered corner removed eastern loci extension uzel earlier portion"
## [22] "loc oven rocks today workers august plan opened cms bottom continuation articulate back push color"
## [23] "half eastern corner southern northern mudbrick removed western stone entire beneath pottery stones uncovered extend"
## [24] "sounding bottom started excavated excavating half mudbrick small silty august bit bone arbitrary pottery artifacts"
## [25] "age yrs recorded root estimate dental bag based finds complete development crown shows visible estimation"
## [26] "oven large ashy small black layer underneath ended context pavement ash finds days year appears"
## [27] "started digging pottery small part dig today day corner rest find decided top elevations workers"
## [28] "ash layer section ashy hard clay cut dirt color sounding began soft mudbrick brown virgin"
## [29] "walls features feature located plaster meters structure inside building southeast abutting directly included eastern based"
## [30] "burial bones bone human skull burials skeleton head left fragments face exposed long animal position"
## [31] "part mud daily backus foundation stone today floor sounding diana corner stones structure amount loci"
## [32] "part dug pieces number south elevations pottery east north cleaned date plaster page digging kenantepe"
## [33] "rocks side rock corner pot dirt pottery middle lot piece cut white southeast line stuff"
## [34] "south north east west side section removed bricks brick appears lines running edge line wide"
## [35] "section layer ubaid material oven excavated noted house features mud northern ashy domestic decided corner"
## [36] "meter layer meters rocks cobbles white long feature river courses bricks number extend material features"
## [37] "burials burial end removed rock large identified part material human additional block exposed piece work"
## [38] "sample pottery carbon samples hap removed removal finds cultural pebble complete collected recovered excavation bone"
## [39] "week weekly walls features oven rock july pottery feature age opening related iron ending slag"
## [40] "furnace slag debri large material small hard chamber continued finding structure chambers excavated samples piece"
And if we turn them into word clouds, we get a sense of the relative importance of words that appear in different topics.
## Loading required package: RColorBrewer
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
##generate clusters of diaries that are similar in their distribution of topics
topic_docs <- data.frame(topic.docs)
names(topic_docs) <- documents$id
## cluster based on shared words
plot(hclust(dist(topic.words)), labels=topics.labels)
library(cluster)
topic_df_dist <- as.matrix(daisy(t(topic_docs), metric = "euclidean", stand = TRUE))
# Change row values to zero if less than row minimum plus row standard deviation
# keep only closely related documents and avoid a dense spagetti diagram
# that's difficult to interpret (hat-tip: http://stackoverflow.com/a/16047196/1036500)
topic_df_dist[ sweep(topic_df_dist, 1, (apply(topic_df_dist,1,min) + apply(topic_df_dist,1,sd) )) > 0 ] <- 0
#' Use kmeans to identify groups of similar diaries
km <- kmeans(topic_df_dist, num.topics)
# get names for each cluster
allnames <- vector("list", length = num.topics)
for(i in 1:num.topics){
allnames[[i]] <- names(km$cluster[km$cluster == i])
}
allnames
## [[1]]
## [1] "230" "311" "357" "360" "794" "836" "918" "920" "953" "956"
## [11] "960" "961" "1243" "1332"
##
## [[2]]
## [1] "153" "164" "169" "297" "299" "312" "366" "396" "457" "513"
## [11] "593" "809" "875" "909" "911" "919" "927" "935" "1020" "1143"
## [21] "1151" "1167" "1184" "1222" "1229" "1252" "1265" "1274" "1293" "1303"
## [31] "1304" "1436"
##
## [[3]]
## [1] "141" "144" "148" "151" "154" "187" "211" "232" "237" "254" "279"
## [12] "340" "347" "367" "444" "575" "583" "584" "601" "639" "804" "820"
##
## [[4]]
## [1] "501" "663" "732" "741" "851" "1058" "1113" "1118" "1185" "1276"
## [11] "1337" "1350" "1378"
##
## [[5]]
## [1] "139" "241" "257" "258" "264" "285" "385" "408" "409" "420" "421"
## [12] "438" "439" "442" "456" "458" "481" "548" "676" "685" "711" "724"
## [23] "738" "742" "761" "762" "764" "790" "817" "877" "882" "895" "908"
## [34] "928" "951"
##
## [[6]]
## [1] "224" "378" "486" "805" "826" "892" "979" "986" "1099" "1124"
## [11] "1135" "1177" "1191" "1194" "1211" "1223" "1228" "1231" "1255" "1262"
## [21] "1266" "1307" "1367" "1380" "1391"
##
## [[7]]
## [1] "152" "185" "206" "222" "234" "250" "271" "308" "319" "332" "345"
## [12] "346" "355" "379" "390" "400" "413" "428" "467" "475" "491" "506"
## [23] "528" "532" "557" "590" "602" "637" "657" "698" "700" "721" "778"
## [34] "803" "815" "866" "962"
##
## [[8]]
## [1] "121" "984" "994" "997" "1021" "1033" "1065" "1082" "1094" "1161"
## [11] "1178" "1218" "1224" "1237" "1246" "1281" "1288" "1301" "1326"
##
## [[9]]
## [1] "199" "358" "373" "401" "403" "416" "479" "531" "541" "582"
## [11] "622" "736" "828" "868" "878" "955" "1238" "1295"
##
## [[10]]
## [1] "433" "451" "585" "619" "621" "677" "690" "759" "839" "1111"
## [11] "1126" "1131" "1140" "1176" "1213" "1270" "1345" "1364" "1379"
##
## [[11]]
## [1] "2" "3" "4" "5" "7" "8" "10" "11" "13" "14"
## [11] "16" "17" "18" "20" "21" "22" "23" "25" "26" "27"
## [21] "29" "30" "32" "35" "36" "37" "38" "41" "43" "44"
## [31] "45" "46" "47" "48" "49" "51" "52" "53" "54" "55"
## [41] "56" "58" "59" "60" "62" "63" "64" "65" "66" "69"
## [51] "71" "72" "73" "74" "75" "77" "79" "80" "81" "82"
## [61] "83" "84" "85" "87" "88" "90" "91" "93" "95" "96"
## [71] "98" "99" "100" "103" "104" "105" "108" "109" "111" "113"
## [81] "114" "116" "117" "118" "119" "120" "122" "123" "124" "125"
## [91] "127" "128" "132" "133" "135" "138" "146" "149" "150" "155"
## [101] "156" "159" "161" "162" "163" "168" "171" "175" "177" "178"
## [111] "181" "184" "193" "194" "198" "201" "203" "205" "207" "209"
## [121] "212" "217" "218" "221" "225" "231" "233" "240" "242" "243"
## [131] "244" "247" "251" "256" "267" "269" "272" "274" "275" "276"
## [141] "277" "278" "281" "286" "287" "288" "291" "292" "293" "300"
## [151] "303" "307" "313" "314" "315" "316" "325" "326" "329" "331"
## [161] "333" "336" "344" "348" "350" "351" "352" "354" "361" "363"
## [171] "386" "391" "392" "402" "406" "410" "412" "417" "422" "425"
## [181] "426" "429" "435" "440" "441" "443" "445" "446" "448" "463"
## [191] "470" "473" "476" "478" "482" "489" "492" "499" "503" "504"
## [201] "505" "509" "512" "516" "518" "519" "534" "537" "540" "543"
## [211] "544" "545" "547" "549" "550" "551" "552" "553" "556" "559"
## [221] "563" "566" "567" "569" "573" "576" "580" "587" "595" "596"
## [231] "600" "603" "605" "606" "609" "617" "620" "623" "628" "629"
## [241] "631" "633" "634" "635" "636" "638" "641" "644" "645" "649"
## [251] "653" "654" "659" "664" "665" "668" "673" "674" "681" "686"
## [261] "687" "689" "692" "695" "696" "697" "699" "701" "704" "705"
## [271] "706" "710" "713" "715" "716" "717" "719" "722" "726" "728"
## [281] "733" "734" "739" "740" "746" "747" "748" "749" "750" "751"
## [291] "758" "760" "766" "767" "768" "771" "772" "773" "775" "776"
## [301] "777" "779" "781" "782" "787" "788" "789" "795" "796" "799"
## [311] "800" "801" "802" "806" "808" "810" "812" "816" "818" "821"
## [321] "822" "823" "825" "829" "831" "834" "835" "837" "840" "845"
## [331] "848" "849" "854" "858" "859" "861" "862" "864" "871" "879"
## [341] "885" "887" "889" "890" "896" "897" "898" "903" "912" "913"
## [351] "914" "915" "926" "929" "930" "931" "932" "934" "941" "944"
## [361] "946" "949" "954" "957" "964" "965" "966" "967" "969" "972"
## [371] "974" "977" "978" "980" "981" "982" "983" "987" "988" "989"
## [381] "990" "992" "993" "995" "1000" "1001" "1002" "1003" "1005" "1006"
## [391] "1008" "1009" "1010" "1011" "1012" "1013" "1014" "1016" "1017" "1018"
## [401] "1019" "1023" "1024" "1025" "1027" "1028" "1029" "1034" "1035" "1038"
## [411] "1041" "1042" "1044" "1045" "1046" "1047" "1048" "1050" "1052" "1053"
## [421] "1054" "1055" "1056" "1061" "1063" "1064" "1066" "1069" "1070" "1071"
## [431] "1072" "1074" "1075" "1076" "1079" "1080" "1081" "1084" "1086" "1087"
## [441] "1088" "1089" "1090" "1091" "1092" "1093" "1098" "1101" "1104" "1109"
## [451] "1115" "1117" "1119" "1121" "1122" "1127" "1138" "1147" "1148" "1155"
## [461] "1156" "1160" "1163" "1166" "1169" "1172" "1183" "1189" "1193" "1195"
## [471] "1196" "1200" "1201" "1209" "1214" "1221" "1226" "1230" "1233" "1234"
## [481] "1247" "1254" "1257" "1258" "1259" "1263" "1264" "1268" "1271" "1275"
## [491] "1277" "1278" "1279" "1283" "1284" "1285" "1287" "1290" "1298" "1299"
## [501] "1306" "1309" "1311" "1312" "1317" "1318" "1319" "1321" "1330" "1333"
## [511] "1335" "1341" "1347" "1348" "1353" "1360" "1366" "1374" "1375" "1381"
## [521] "1382" "1383" "1385" "1386" "1387" "1388" "1389" "1396" "1397" "1398"
## [531] "1399" "1401" "1404" "1408" "1410" "1411" "1413" "1414" "1418" "1420"
## [541] "1421" "1422" "1425" "1430" "1433" "1434" "1437" "1438" "1441" "1446"
## [551] "1449" "1458" "1459"
##
## [[12]]
## [1] "165" "180" "186" "192" "208" "216" "223" "236" "318" "387" "399"
## [12] "411" "424" "462" "488" "514" "523" "572" "581" "610" "625" "632"
## [23] "648" "660" "691" "774" "811" "904" "940" "968" "973"
##
## [[13]]
## [1] "170" "249" "290" "317" "339" "359" "371" "449" "472" "483" "487"
## [12] "554" "672" "745" "765" "916" "945" "950" "963" "970"
##
## [[14]]
## [1] "1004" "1102" "1136" "1180" "1187" "1269" "1272" "1273" "1324" "1371"
## [11] "1395" "1455"
##
## [[15]]
## [1] "1390" "1392" "1394" "1400" "1402" "1403" "1405" "1406" "1407" "1409"
## [11] "1412" "1415" "1419" "1435" "1439" "1444" "1445" "1448" "1450" "1451"
## [21] "1454" "1456" "1457"
##
## [[16]]
## [1] "50" "196" "213" "255" "294" "309" "322" "452" "480" "500"
## [11] "502" "517" "526" "539" "546" "558" "708" "729" "780" "841"
## [21] "1150"
##
## [[17]]
## [1] "40" "92" "137" "191" "227" "229" "330" "372" "384" "405"
## [11] "436" "454" "468" "495" "555" "655" "786" "814" "853" "894"
## [21] "902" "910" "947" "1316"
##
## [[18]]
## [1] "126" "140" "145" "182" "195" "226" "248" "389" "437" "453"
## [11] "477" "542" "886" "1073" "1100" "1103"
##
## [[19]]
## [1] "328" "382" "393" "418" "471" "490" "652" "670" "671" "727"
## [11] "769" "847" "937" "943" "948" "1157" "1188" "1205"
##
## [[20]]
## [1] "999" "1015" "1022" "1031" "1037" "1039" "1049" "1057" "1062" "1077"
## [11] "1085" "1132" "1139" "1145" "1149" "1162" "1171" "1182" "1190" "1236"
## [21] "1250" "1294" "1300" "1313" "1322" "1331" "1338" "1356" "1365" "1373"
## [31] "1377"
##
## [[21]]
## [1] "189" "200" "219" "220" "228" "235" "245" "252" "261" "268" "280"
## [12] "284" "298" "327" "338" "341" "362" "364" "369" "375" "383" "394"
## [23] "419" "455" "459" "460" "461" "666" "678" "693" "867" "870" "880"
## [34] "888" "899" "907" "939"
##
## [[22]]
## [1] "160" "188" "215" "253" "263" "282" "296" "324" "335" "353"
## [11] "370" "380" "398" "414" "432" "465" "484" "493" "498" "515"
## [21] "568" "571" "574" "578" "588" "592" "607" "608" "612" "615"
## [31] "626" "650" "651" "661" "694" "709" "723" "735" "744" "755"
## [41] "813" "824" "852" "857" "876" "900" "1030" "1036" "1040" "1043"
## [51] "1051" "1060" "1068" "1078" "1083" "1240" "1256" "1280" "1343" "1351"
## [61] "1357" "1363" "1370"
##
## [[23]]
## [1] "190" "238" "259" "273" "304" "415" "434" "464" "469" "497" "508"
## [12] "533" "579" "586" "591" "597" "598" "611" "730" "753" "763" "855"
## [23] "883" "893" "924" "958" "975"
##
## [[24]]
## [1] "560" "561" "562" "846"
##
## [[25]]
## [1] "6" "9" "12" "15" "19" "24" "28" "34" "39" "42" "57"
## [12] "61" "67" "76" "102" "106" "131"
##
## [[26]]
## [1] "262" "283" "427" "447" "496" "510" "525" "564" "565" "570" "577"
## [12] "627" "754" "798" "838" "872" "971"
##
## [[27]]
## [1] "158" "183" "246" "682" "791" "1128" "1134" "1141" "1146" "1158"
## [11] "1175" "1206" "1212" "1227" "1245" "1327"
##
## [[28]]
## [1] "265" "266" "289" "302" "323" "334" "374" "397" "407" "485"
## [11] "520" "527" "669" "819" "933" "936" "1244"
##
## [[29]]
## [1] "143" "147" "167" "197" "204" "214" "260" "521" "630" "646"
## [11] "752" "832" "844" "938" "1323"
##
## [[30]]
## [1] "86" "112" "176" "301" "365" "599" "688" "712" "770" "793"
## [11] "1110" "1320"
##
## [[31]]
## [1] "239" "395" "833" "1170" "1202" "1296" "1315" "1328" "1336" "1423"
## [11] "1440" "1442" "1447" "1452"
##
## [[32]]
## [1] "305" "320" "337" "349" "368" "376" "589" "594" "604" "642" "658"
## [12] "679" "737" "785" "830" "860" "881" "922"
##
## [[33]]
## [1] "172" "343" "614" "624" "656" "718" "720" "850" "905" "923"
## [11] "1007" "1106" "1116" "1123" "1130" "1144" "1154" "1159" "1165" "1174"
## [21] "1197" "1232" "1235" "1249" "1291" "1297" "1325" "1334" "1344" "1352"
## [31] "1355" "1361" "1376" "1417"
##
## [[34]]
## [1] "142" "174" "342" "423" "494" "522" "524" "530" "535" "618"
## [11] "643" "1112" "1215" "1217" "1241" "1260" "1340" "1349" "1359" "1362"
## [21] "1384" "1393" "1416" "1424"
##
## [[35]]
## [1] "1097" "1105" "1108" "1120" "1125" "1137" "1142" "1152" "1164" "1173"
## [11] "1181" "1186" "1198" "1208" "1210" "1239" "1253" "1261" "1292" "1302"
## [21] "1305" "1314" "1329" "1339" "1342" "1346" "1354" "1358" "1372" "1427"
## [31] "1429" "1431"
##
## [[36]]
## [1] "33" "68" "70" "78" "89" "94" "97" "107" "110" "115" "129"
## [12] "130" "134" "136" "157" "166" "173" "179" "202" "210" "270" "295"
## [23] "321" "381" "404" "511"
##
## [[37]]
## [1] "31" "310" "356" "377" "388" "430" "431" "450" "474" "507"
## [11] "529" "538" "613" "640" "647" "662" "667" "675" "680" "683"
## [21] "702" "707" "725" "731" "743" "756" "784" "797" "842" "856"
## [31] "863" "869" "873" "1095" "1107" "1114" "1129" "1133" "1153" "1168"
## [41] "1179" "1192" "1199" "1203" "1204" "1216" "1219" "1220" "1242" "1248"
## [51] "1267" "1282" "1289" "1310" "1368"
##
## [[38]]
## [1] "101" "306" "466" "536" "783" "921" "925" "976" "985" "991"
## [11] "996" "998" "1026" "1032" "1059" "1067" "1096" "1286" "1369"
##
## [[39]]
## [1] "1207" "1225" "1251" "1308" "1426" "1428" "1432" "1443" "1453" "1460"
##
## [[40]]
## [1] "616" "684" "703" "714" "757" "792" "807" "827" "843" "865" "874"
## [12] "884" "891" "901" "906" "917" "942" "952" "959"
library(igraph)
g <- as.undirected(graph.adjacency(topic_df_dist))
layout1 <- layout.fruchterman.reingold(g, niter=500)
plot(g, layout=layout1, edge.curved = TRUE, vertex.size = 1, vertex.color= "grey", edge.arrow.size = 0, vertex.label.dist=0.5, vertex.label = NA)
write.graph(g, file="oc2.graphml", format="graphml")
For the Textplot output, please see my repo here. There will also be an interactive version of the network above, in that repository (along with a key for retrieving the original diaries.)