This is a report for my course project in natural language processing and text mining, where I am doing analytics of a selected corpus of texts. For my corpus, I have scraped sequences of blog posts from Less Wrong, where Eliezer Yudkowsky, Scott Alexander, and others write about artificial intelligence, behavioral economics, rationality, neuroscience, and human decision making from theoretical perspectives. Eliezer is well-known for presenting ideas about friendly AI, and founded the Machine Intelligence Research Institute. For this exploratory analysis I will focus on the Argument and Analysis sequence by Scott Alexander.
Key points / links:
For scraping the sequences of essays I used the rvest library. Next to do basic text processing and cleaning I used the tm, tidyverse, and tidytext libraries. This was helpful as I got to use the %>% operator in processing, which provides a clean and easy-to-interpret syntax.
For visualizations I used the popular ggplot2 library, which fits within the core Tidyverse and also extends the use of the %>% operator. Additionally, I used the ggraph and igraph libraries to plot word and bigram correlations.
For vectorizing text and creating LDA models, I used the ldatuning, text2vec, and topicmodels libraries.
Essays in Argument and Analysis Sequence
Right away in some exploratory text analysis it is clear the sequence of essays is focused on argumentation and discourse in terms of word counts. The plot below shows word frequencies across the essays in the argument and analysis sequence. Many words such as contrarian, position, argument, students, law, and true point to rationality in discourse. It also seems that there could be some general themes such as human behavior, decision making, and cognition.
Similarly with the top negative tokens there is a general theme with words that point towards discourse on complex, foggy topics. False, wrong, excuse, bias, defect, and denial suggest examples of poorly defended positions in arguments across the essays.
The plots above contain a few interesting points that illustrate well how important it is to have context in reading content. In the absence of other words, it is impossible to determine the meaning beyond the definition of the individual word itself. For example bias is recorded as a token with negative sentiment, but doesn’t factor in if the text contains words that suggest bias leaning in one direction or another. The text could say something along the lines of the absence of bias, which could potentially be positive. Free and worth in the positive sentiments plot are also in a similar bin. When used with other words, free could convey both positive and negative sentiments. Free of harm is positive, while a free item could be written differently: time spent or time lost, which have negative sentiments. This could be said about all words, but I think it is important to recognize how important context is and how sentiment analysis can actually be somewhat misleading at times.
The above bigrams lead with Pretty, while the negative bigrams below are misleading. While in the positive bigrams leading with pretty, the second word is mapped to a corresponding sentiment that we could see as correct. In the negative bigrams however, when leading with words such as never, no, not, and without, the second word seems to be considered in isolation. For example, we could consider a bigram such as never liked as negative, but only the second term is recognized in terms of its sentiment. So more frequently across the negative bigrams the sentiment could be reversed.
Top words by URL
## [1] "cardiologist" "confidentiality" "goat" "pig"
## [5] "house" "proving" "doctor" "wolf"
## [9] "experiences" "heraclitus" "troll" "proves"
## [13] "argument" "audience" "bang" "rigor"
## [17] "rape" "black" "cardiologists" "gandhi"
## [21] "violating" "cardiology" "cops" "bazooka"
## [25] "bridge"
Top tokens
## [1] "clumsy" "gameplayer" "partner" "playing" "iterated"
## [6] "prisoners" "dilemma" "publicly" "precommitted" "titfortat"
## [11] "strategy" "iteration" "5" "youre" "happily"
## [16] "raking" "bonuses" "cooperation" "partner" "unexpectedly"
## [21] "presses" "defect" "button" "uh" "partner"
Top bigrams
## # A tibble: 15,289 x 2
## bigram n
## <chr> <int>
## 1 slippery slope 15
## 2 slippery slopes 15
## 3 schelling point 14
## 4 brute strength 13
## 5 holocaust denial 12
## 6 medical confidentiality 12
## 7 conspicuously signal 8
## 8 ethics ethics 8
## 9 exist preference 8
## 10 general principle 8
## # ... with 15,279 more rows
Tokens
## [1] "clumsy" "gameplayer" "partner" "playing" "iterated"
## [6] "prisoners" "dilemma" "publicly" "precommitted" "titfortat"
## [11] "strategy" "iteration" "5" "youre" "happily"
## [16] "raking" "bonuses" "cooperation" "partner" "unexpectedly"
## [21] "presses" "defect" "button" "uh" "partner"
Top words by URL
## [1] "cardiologist" "confidentiality" "goat" "pig"
## [5] "house" "proving" "doctor" "wolf"
## [9] "experiences" "heraclitus" "troll" "proves"
## [13] "argument" "audience" "bang" "rigor"
## [17] "rape" "black" "cardiologists" "gandhi"
## [21] "violating" "cardiology" "cops" "bazooka"
## [25] "bridge"
Top bigrams from tf_idf
## [1] "medical confidentiality" "york times"
## [3] "argument proves" "built house"
## [5] "reality doesnt" "creates expectation"
## [7] "violating medical" "virtue points"
## [9] "argument disproven" "conception policeman"
## [11] "disprove existence" "enshrine general"
## [13] "father leading" "general prevent"
## [15] "golden rule" "imagine mother"
## [17] "impossible disprove" "leading conception"
## [19] "mother raped" "opponent minutes"
## [21] "policeman prevented" "policemen general"
## [23] "prevent rape" "prevented rape"
## [25] "principle policemen"
In the list of bigrams above from the term frequency-inverse document frequency matrix we can see the terms provide some more context on the essays in the sequence. In a sequence focusing on rationality, argumentation, and analysis it makes sense to see some topics of more contention. In particular, argument proves, argument disproven, disprove existence, and creates expectation all point towards more rational approaches to discourse.
Bigram relationships
The plot above shows frequently occurring bigrams, or frequently co-occurring words. Many seem to be what we could describe as typical bigrams, ones that occur across sequences of text in many domains: medical confidentiality, slippery slope, brute strength, global warming, makes sense, and minimum wage. The frequent occurrence of accepting excuse fits in with the argument and analysis theme across the essays.
Some example bigram correlations
Strong bigrams
## # A tibble: 322 x 3
## item1 item2 correlation
## <chr> <chr> <dbl>
## 1 strong human 1.
## 2 strong care 0.802
## 3 strong wrong 0.802
## 4 strong hand 0.802
## 5 strong ive 0.802
## 6 strong coming 0.802
## 7 strong give 0.802
## 8 strong possibly 0.802
## 9 strong support 0.802
## 10 strong back 0.802
## # ... with 312 more rows
Pretty bigrams
## # A tibble: 322 x 3
## item1 item2 correlation
## <chr> <chr> <dbl>
## 1 pretty position 1.
## 2 pretty wrong 0.802
## 3 pretty work 0.802
## 4 pretty give 0.802
## 5 pretty possibly 0.802
## 6 pretty support 0.802
## 7 pretty principle 0.802
## 8 pretty youre 0.655
## 9 pretty class 0.655
## 10 pretty break 0.655
## # ... with 312 more rows
Philosophical bigrams
## # A tibble: 322 x 3
## item1 item2 correlation
## <chr> <chr> <dbl>
## 1 philosophical based 0.816
## 2 philosophical exist 0.802
## 3 philosophical contrarian 0.802
## 4 philosophical argument 0.667
## 5 philosophical principle 0.667
## 6 philosophical virtue 0.667
## 7 philosophical slippery 0.612
## 8 philosophical slopes 0.612
## 9 philosophical outcomes 0.612
## 10 philosophical liberal 0.612
## # ... with 312 more rows
Signal bigrams
## # A tibble: 322 x 3
## item1 item2 correlation
## <chr> <chr> <dbl>
## 1 signal signaling 1
## 2 signal warming 1
## 3 signal uneducated 1
## 4 signal metacontrarian 1
## 5 signal triad 1
## 6 signal meta 1
## 7 signal defect 0.667
## 8 signal costs 0.667
## 9 signal quality 0.667
## 10 signal academic 0.667
## # ... with 312 more rows
Position, principle, religion, science
Pairwise correlations of word relationships
Correlations greater than 0.85
Identifying ideal number of topics
## fit models... done.
## calculate metrics:
## Griffiths2004... done.
## CaoJuan2009... done.
## Arun2010... done.
## Devaud2014... unknown!
Ideal number of topics appears to be 12, where Griffiths, CaoJuan, and Arun are closest.
Key words across the topic clusters
Model evaluations on each iteration
## INFO [19:18:57.761] epoch 1, loss 0.1361
## INFO [19:18:57.875] epoch 2, loss 0.0730
## INFO [19:18:57.908] epoch 3, loss 0.0548
## INFO [19:18:57.943] epoch 4, loss 0.0443
## INFO [19:18:57.971] epoch 5, loss 0.0371
## INFO [19:18:57.993] epoch 6, loss 0.0318
## INFO [19:18:58.029] epoch 7, loss 0.0276
## INFO [19:18:58.054] epoch 8, loss 0.0243
## INFO [19:18:58.076] epoch 9, loss 0.0215
## INFO [19:18:58.111] epoch 10, loss 0.0193
Model dimensionality
## [1] 863 50
Some sample word context measures
We can think of word embeddings as vectorized representations of text, in other words they are numerically expressed in a nested way. For example a given set of words will have a numerical value representing the weighted context across the sub-group it belongs to. In a sense, the sub-group of words are mapped to a vector and then numerically expressed by their co-occurrences. A group of words could be a group of paragraphs, while a sub-group could refer to each individual group of sentences within the given paragraph.
## smarter clumsy military mind metacontrarians
## 1.0000000 0.4355702 0.3772758 0.3513603 0.3427298
## smarter economy outcomes fits system
## 0.4606388 0.4420414 0.4173746 0.4147477 0.3958409
I read LessWrong to keep in touch with rationality and philosophy as artificial intelligence and technology continue to advance at a rapid rate. I really enjoyed some economics courses in undergraduate, which introduced me briefly to the economics of human behavior, decision sciences, or behavioral economics. I also wanted to challenge myself by doing analytics on ambiguous text; text that contains many possible meanings, extends metaphorically across different examples, and applies in its own unique way from domain to domain.
As in the 621 project, this was interesting in order to see some of the more challenging aspects of mining text and extracting meaning. I am interested in extracting logic from passages of text, especially those containing meta-laws or rule-based logic that is layered in a sense. I could see how this would go beyond the scope of a one-semester course and extend into a long-term research project. As for the project analysis itself, this could be extended to many different bodies of text and there will be a way to identify some key characteristics of passages.
I learned how sentiment analysis is actually high-risk in a way; how it can’t be relied on as a stand-alone form of text analysis, but rather needs to be augmented with other forms of analysis. An individual word without context could be represented in any way, the words in the corresponding group or sub-group of words provide meaning and lend a form of directionality to the initial word being observed.