library(tidyverse)
library(DT)
library(tidytext)        
library(readxl)         
  1. Read in the text and unnest the words.
manifestos <- read_excel("manifestos.xlsx")


manifesto_words <- manifestos %>% 
unnest_tokens(word, text)

manifesto_words

Here are the words from the manifesto file, grouped by individual word from each separate manifesto.

  1. Generate a table that includes both lexical diversity and density, and the total number of words, of each document.
manifesto_words %>% 
  group_by(author) %>% 
  summarize(num_words = n(), lex_diversity = n_distinct(word), 
            lex_density = n_distinct(word)/n())

This is a table highlighting the number of words used, the lexical diversity (different word usage) and lexical density (how often the different words appear) of each manifesto. Rodger wrote the longest manifesto with over 6,000 words, and his diversity of word usage (lexical diversity) was fairly high as well. HarperMercer had the highest lexical density of the group, indicating that this manifesto had a richer use of language relative to it’s length (1589 words).

  1. Generate a table with the mean word length of each document.
manifesto_words %>% 
  group_by(author) %>% 
  mutate(word_length = nchar(word)) %>% 
  summarize(mean_word_length = mean(word_length)) %>% 
  arrange(-mean_word_length)

Here is a table of the longest average words used in the manifestos. All the documents were similar, but Breivik had the longest words on average with 5.15 letters per word.

  1. Genernate a graph with mini histograms of each document’s word lengths.
manifesto_words %>% 
  mutate(word_length = nchar(word)) %>% 
  ggplot(aes(word_length)) +
  geom_histogram(binwidth = 1) +
  facet_wrap(vars(author), scales = "free_y")+
  labs(title = "Word Length of Manifestos by Document")

NA

Here is a graphic depicting each manifesto’s word length, and by the length of specific words. All have a very similar distribution, with each mainly using words less than 10 characters long. However, Rodger had the longest manifesto in terms of number of used words with over 1,200, compared to Auvinen who used just over 300 total words.

  1. Remove stop words and then create a graph with the most common words in each document.
manifesto_words %>% 
  anti_join(stop_words) %>% 
  group_by(author) %>% 
  count(word, sort = T) %>% 
  top_n(5) %>% 
  ungroup() %>%
  mutate(word = reorder(word, n)) %>% 
  ggplot(aes(word, n, fill = author))+
  geom_col(show.legend = F)+
  labs(x= NULL, y = "Most Common Words in Each Manifesto")+
  facet_wrap(vars(author), scales = "free")+
  scale_fill_viridis_d()+
  theme_minimal()+
  coord_flip()
Joining, by = "word"
Selecting by n

Here is a graph of the most commonly used words in each manifesto, with the stop words removed. Creating this graph begins to show more unique qualities of each author, yet there are similar themes that run through all of them. These include topics such as politics, religion, and society in general.

  1. Calculate tf-idfs and create a graph of the words with the highest tf-idfs in each document.
manifesto_word_counts <- manifestos %>% 
  unnest_tokens(word, text) %>% 
  count(author, word, sort = T)

total_words <- manifesto_word_counts %>% 
  group_by(author) %>% 
  summarize(total = sum(n)) 

manifesto_word_counts <- left_join(manifesto_word_counts, total_words)
Joining, by = "author"
manifesto_tf_idf <- manifesto_word_counts %>% 
  bind_tf_idf(word, author, n)

manifesto_tf_idf %>% 
  arrange(-tf_idf) %>%
  mutate(word = factor(word, levels = rev(unique(word)))) %>% 
  group_by(author) %>% 
  top_n(5) %>% 
  ggplot(aes(word, tf_idf, fill = author)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~author, scales = "free") +
  coord_flip()+
  theme_minimal() +
  scale_fill_viridis_d() +
  labs(title = "Highest tf-idf's in each Manifesto")
Selecting by tf_idf

Here is a graph showing the highest tf-idf’s in each manifesto. The highest tf-idf is essentially a ranking of the most important words in the document relative to other documents. Some authors had a much more distinct tf-idf (Cho, with “you” compared to “jesus” or “wanna”) while others had a more even spread (HarperMercer, with “i”, the top word, being relatively similar to the other words the used).

LS0tCnRpdGxlOiAiUiBOb3RlYm9vayIKb3V0cHV0OiBodG1sX25vdGVib29rCi0tLQoKCmBgYHtyfQpsaWJyYXJ5KHRpZHl2ZXJzZSkKbGlicmFyeShEVCkKbGlicmFyeSh0aWR5dGV4dCkgICAgICAgIApsaWJyYXJ5KHJlYWR4bCkgICAgICAgICAKCmBgYAoKCjEuIFJlYWQgaW4gdGhlIHRleHQgYW5kIHVubmVzdCB0aGUgd29yZHMuICAKCmBgYHtyfQptYW5pZmVzdG9zIDwtIHJlYWRfZXhjZWwoIm1hbmlmZXN0b3MueGxzeCIpCgoKbWFuaWZlc3RvX3dvcmRzIDwtIG1hbmlmZXN0b3MgJT4lIAp1bm5lc3RfdG9rZW5zKHdvcmQsIHRleHQpCgptYW5pZmVzdG9fd29yZHMKYGBgCkhlcmUgYXJlIHRoZSB3b3JkcyBmcm9tIHRoZSBtYW5pZmVzdG8gZmlsZSwgZ3JvdXBlZCBieSBpbmRpdmlkdWFsIHdvcmQgZnJvbSBlYWNoIHNlcGFyYXRlIG1hbmlmZXN0by4KCjIuIEdlbmVyYXRlIGEgdGFibGUgdGhhdCBpbmNsdWRlcyBib3RoIGxleGljYWwgZGl2ZXJzaXR5IGFuZCBkZW5zaXR5LCBhbmQgdGhlIHRvdGFsIG51bWJlciBvZiB3b3Jkcywgb2YgZWFjaCBkb2N1bWVudC4gIAoKYGBge3J9Cm1hbmlmZXN0b193b3JkcyAlPiUgCiAgZ3JvdXBfYnkoYXV0aG9yKSAlPiUgCiAgc3VtbWFyaXplKG51bV93b3JkcyA9IG4oKSwgbGV4X2RpdmVyc2l0eSA9IG5fZGlzdGluY3Qod29yZCksIAogICAgICAgICAgICBsZXhfZGVuc2l0eSA9IG5fZGlzdGluY3Qod29yZCkvbigpKQpgYGAKVGhpcyBpcyBhIHRhYmxlIGhpZ2hsaWdodGluZyB0aGUgbnVtYmVyIG9mIHdvcmRzIHVzZWQsIHRoZSBsZXhpY2FsIGRpdmVyc2l0eSAoZGlmZmVyZW50IHdvcmQgdXNhZ2UpIGFuZCBsZXhpY2FsIGRlbnNpdHkgKGhvdyBvZnRlbiB0aGUgZGlmZmVyZW50IHdvcmRzIGFwcGVhcikgb2YgZWFjaCBtYW5pZmVzdG8uIFJvZGdlciB3cm90ZSB0aGUgbG9uZ2VzdCBtYW5pZmVzdG8gd2l0aCBvdmVyIDYsMDAwIHdvcmRzLCBhbmQgaGlzIGRpdmVyc2l0eSBvZiB3b3JkIHVzYWdlIChsZXhpY2FsIGRpdmVyc2l0eSkgd2FzIGZhaXJseSBoaWdoIGFzIHdlbGwuIEhhcnBlck1lcmNlciBoYWQgdGhlIGhpZ2hlc3QgbGV4aWNhbCBkZW5zaXR5IG9mIHRoZSBncm91cCwgaW5kaWNhdGluZyB0aGF0IHRoaXMgbWFuaWZlc3RvIGhhZCBhIHJpY2hlciB1c2Ugb2YgbGFuZ3VhZ2UgcmVsYXRpdmUgdG8gaXQncyBsZW5ndGggKDE1ODkgd29yZHMpLgoKCjMuIEdlbmVyYXRlIGEgdGFibGUgd2l0aCB0aGUgbWVhbiB3b3JkIGxlbmd0aCBvZiBlYWNoIGRvY3VtZW50LiAgCgpgYGB7cn0KbWFuaWZlc3RvX3dvcmRzICU+JSAKICBncm91cF9ieShhdXRob3IpICU+JSAKICBtdXRhdGUod29yZF9sZW5ndGggPSBuY2hhcih3b3JkKSkgJT4lIAogIHN1bW1hcml6ZShtZWFuX3dvcmRfbGVuZ3RoID0gbWVhbih3b3JkX2xlbmd0aCkpICU+JSAKICBhcnJhbmdlKC1tZWFuX3dvcmRfbGVuZ3RoKQpgYGAKSGVyZSBpcyBhIHRhYmxlIG9mIHRoZSBsb25nZXN0IGF2ZXJhZ2Ugd29yZHMgdXNlZCBpbiB0aGUgbWFuaWZlc3Rvcy4gQWxsIHRoZSBkb2N1bWVudHMgd2VyZSBzaW1pbGFyLCBidXQgQnJlaXZpayBoYWQgdGhlIGxvbmdlc3Qgd29yZHMgb24gYXZlcmFnZSB3aXRoIDUuMTUgbGV0dGVycyBwZXIgd29yZC4KCjQuIEdlbmVybmF0ZSBhIGdyYXBoIHdpdGggbWluaSBoaXN0b2dyYW1zIG9mIGVhY2ggZG9jdW1lbnQncyB3b3JkIGxlbmd0aHMuICAKCgpgYGB7cn0KbWFuaWZlc3RvX3dvcmRzICU+JSAKICBtdXRhdGUod29yZF9sZW5ndGggPSBuY2hhcih3b3JkKSkgJT4lIAogIGdncGxvdChhZXMod29yZF9sZW5ndGgpKSArCiAgZ2VvbV9oaXN0b2dyYW0oYmlud2lkdGggPSAxKSArCiAgZmFjZXRfd3JhcCh2YXJzKGF1dGhvciksIHNjYWxlcyA9ICJmcmVlX3kiKSsKICBsYWJzKHRpdGxlID0gIldvcmQgTGVuZ3RoIG9mIE1hbmlmZXN0b3MgYnkgRG9jdW1lbnQiKQogIApgYGAKSGVyZSBpcyBhIGdyYXBoaWMgZGVwaWN0aW5nIGVhY2ggbWFuaWZlc3RvJ3Mgd29yZCBsZW5ndGgsIGFuZCBieSB0aGUgbGVuZ3RoIG9mIHNwZWNpZmljIHdvcmRzLiBBbGwgaGF2ZSBhIHZlcnkgc2ltaWxhciBkaXN0cmlidXRpb24sIHdpdGggZWFjaCBtYWlubHkgdXNpbmcgd29yZHMgbGVzcyB0aGFuIDEwIGNoYXJhY3RlcnMgbG9uZy4gSG93ZXZlciwgUm9kZ2VyIGhhZCB0aGUgbG9uZ2VzdCBtYW5pZmVzdG8gaW4gdGVybXMgb2YgbnVtYmVyIG9mIHVzZWQgd29yZHMgd2l0aCBvdmVyIDEsMjAwLCBjb21wYXJlZCB0byBBdXZpbmVuIHdobyB1c2VkIGp1c3Qgb3ZlciAzMDAgdG90YWwgd29yZHMuIAoKCjUuIFJlbW92ZSBzdG9wIHdvcmRzIGFuZCB0aGVuIGNyZWF0ZSBhIGdyYXBoIHdpdGggdGhlIG1vc3QgY29tbW9uIHdvcmRzIGluIGVhY2ggZG9jdW1lbnQuICAKCmBgYHtyfQptYW5pZmVzdG9fd29yZHMgJT4lIAogIGFudGlfam9pbihzdG9wX3dvcmRzKSAlPiUgCiAgZ3JvdXBfYnkoYXV0aG9yKSAlPiUgCiAgY291bnQod29yZCwgc29ydCA9IFQpICU+JSAKICB0b3Bfbig1KSAlPiUgCiAgdW5ncm91cCgpICU+JQogIG11dGF0ZSh3b3JkID0gcmVvcmRlcih3b3JkLCBuKSkgJT4lIAogIGdncGxvdChhZXMod29yZCwgbiwgZmlsbCA9IGF1dGhvcikpKwogIGdlb21fY29sKHNob3cubGVnZW5kID0gRikrCiAgbGFicyh4PSBOVUxMLCB5ID0gIk1vc3QgQ29tbW9uIFdvcmRzIGluIEVhY2ggTWFuaWZlc3RvIikrCiAgZmFjZXRfd3JhcCh2YXJzKGF1dGhvciksIHNjYWxlcyA9ICJmcmVlIikrCiAgc2NhbGVfZmlsbF92aXJpZGlzX2QoKSsKICB0aGVtZV9taW5pbWFsKCkrCiAgY29vcmRfZmxpcCgpCmBgYApIZXJlIGlzIGEgZ3JhcGggb2YgdGhlIG1vc3QgY29tbW9ubHkgdXNlZCB3b3JkcyBpbiBlYWNoIG1hbmlmZXN0bywgd2l0aCB0aGUgc3RvcCB3b3JkcyByZW1vdmVkLiBDcmVhdGluZyB0aGlzIGdyYXBoIGJlZ2lucyB0byBzaG93IG1vcmUgdW5pcXVlIHF1YWxpdGllcyBvZiBlYWNoIGF1dGhvciwgeWV0IHRoZXJlIGFyZSBzaW1pbGFyIHRoZW1lcyB0aGF0IHJ1biB0aHJvdWdoIGFsbCBvZiB0aGVtLiBUaGVzZSBpbmNsdWRlIHRvcGljcyBzdWNoIGFzIHBvbGl0aWNzLCByZWxpZ2lvbiwgYW5kIHNvY2lldHkgaW4gZ2VuZXJhbC4gCgoKNi4gQ2FsY3VsYXRlIHRmLWlkZnMgYW5kIGNyZWF0ZSBhIGdyYXBoIG9mIHRoZSB3b3JkcyB3aXRoIHRoZSBoaWdoZXN0IHRmLWlkZnMgaW4gZWFjaCBkb2N1bWVudC4gIAoKCmBgYHtyfQptYW5pZmVzdG9fd29yZF9jb3VudHMgPC0gbWFuaWZlc3RvcyAlPiUgCiAgdW5uZXN0X3Rva2Vucyh3b3JkLCB0ZXh0KSAlPiUgCiAgY291bnQoYXV0aG9yLCB3b3JkLCBzb3J0ID0gVCkKCnRvdGFsX3dvcmRzIDwtIG1hbmlmZXN0b193b3JkX2NvdW50cyAlPiUgCiAgZ3JvdXBfYnkoYXV0aG9yKSAlPiUgCiAgc3VtbWFyaXplKHRvdGFsID0gc3VtKG4pKSAKCm1hbmlmZXN0b193b3JkX2NvdW50cyA8LSBsZWZ0X2pvaW4obWFuaWZlc3RvX3dvcmRfY291bnRzLCB0b3RhbF93b3JkcykKCm1hbmlmZXN0b190Zl9pZGYgPC0gbWFuaWZlc3RvX3dvcmRfY291bnRzICU+JSAKICBiaW5kX3RmX2lkZih3b3JkLCBhdXRob3IsIG4pCgptYW5pZmVzdG9fdGZfaWRmICU+JSAKICBhcnJhbmdlKC10Zl9pZGYpICU+JQogIG11dGF0ZSh3b3JkID0gZmFjdG9yKHdvcmQsIGxldmVscyA9IHJldih1bmlxdWUod29yZCkpKSkgJT4lIAogIGdyb3VwX2J5KGF1dGhvcikgJT4lIAogIHRvcF9uKDUpICU+JSAKICBnZ3Bsb3QoYWVzKHdvcmQsIHRmX2lkZiwgZmlsbCA9IGF1dGhvcikpICsKICBnZW9tX2NvbChzaG93LmxlZ2VuZCA9IEZBTFNFKSArCiAgbGFicyh4ID0gTlVMTCwgeSA9ICJ0Zi1pZGYiKSArCiAgZmFjZXRfd3JhcCh+YXV0aG9yLCBzY2FsZXMgPSAiZnJlZSIpICsKICBjb29yZF9mbGlwKCkrCiAgdGhlbWVfbWluaW1hbCgpICsKICBzY2FsZV9maWxsX3ZpcmlkaXNfZCgpICsKICBsYWJzKHRpdGxlID0gIkhpZ2hlc3QgdGYtaWRmJ3MgaW4gZWFjaCBNYW5pZmVzdG8iKQoKYGBgCkhlcmUgaXMgYSBncmFwaCBzaG93aW5nIHRoZSBoaWdoZXN0IHRmLWlkZidzIGluIGVhY2ggbWFuaWZlc3RvLiBUaGUgaGlnaGVzdCB0Zi1pZGYgaXMgZXNzZW50aWFsbHkgYSByYW5raW5nIG9mIHRoZSBtb3N0IGltcG9ydGFudCB3b3JkcyBpbiB0aGUgZG9jdW1lbnQgcmVsYXRpdmUgdG8gb3RoZXIgZG9jdW1lbnRzLiBTb21lIGF1dGhvcnMgaGFkIGEgbXVjaCBtb3JlIGRpc3RpbmN0IHRmLWlkZiAoQ2hvLCB3aXRoICJ5b3UiIGNvbXBhcmVkIHRvICJqZXN1cyIgb3IgIndhbm5hIikgd2hpbGUgb3RoZXJzIGhhZCBhIG1vcmUgZXZlbiBzcHJlYWQgKEhhcnBlck1lcmNlciwgd2l0aCAiaSIsIHRoZSB0b3Agd29yZCwgYmVpbmcgcmVsYXRpdmVseSBzaW1pbGFyIHRvIHRoZSBvdGhlciB3b3JkcyB0aGUgdXNlZCkuCg==