library(tidyverse)
library(DT)
library(tidytext) # package for text analysis
library(readxl) # reads excel files, the format I used for the data
suicide_notes <- read_excel("suicide_notes.xlsx")
suicide_notes
Notice that when you first read these documents, they’re just a long string of text. We need to separate the words so they can be analyzed.
The tidytext command unnest_tokens() will separate (or unnest) the words, so that there is one word per row. This command will also remove all punctuation and make everything lower case. In unnest_tokens(word, text), ‘text’ is the name of the original column with the text in it, and ‘word’ is what we want the new column to be named.
suicide_words <- suicide_notes %>%
unnest_tokens(word, text)
suicide_words
NA
The number of words used in a text can be calculated with a simple n() function that counts the number of words.
But many of the words we use are repetitions: Words like ‘the’ and ‘a’ are used over and over. So the number of distinct words is less than the total number of words. We can calculate the number of distinct words with n_distinct().
Both of these measures are calculated below:
suicide_words %>%
group_by(author) %>%
summarize(num_words = n(), lex_diversity = n_distinct(word))
NA
The number of distinct words is a measure of lexical diversity. It is one measure of an individual’s vocabulary.
Measures like this have been used to determine authorship (e.g., in cases where a suicide note is suspected of being fake and having been written by someone else), and some well-known studies have shown that low linguistic diversity at a young age has been found to predict dementia in old age.
In general, it’s apparent that longer notes have more distinct words, which makes sense. The following is a measure of lexical density, which is the number of distinct words divided by the total number of words. The higher the number, the higher proportion of distinct words are being used. The smaller the number, the more repeat words are used.
suicide_words %>%
group_by(author) %>%
summarise(num_words = n(),
lex_diversity = n_distinct(word),
lex_density = n_distinct(word)/n())
Another measure related to verbal complexity is the length of the words, i.e., the number of characters in each word. That can be calculated with nchar().
Here is a table with the length of words, with the longest words at the top of the table.
suicide_words %>%
mutate(word_length = nchar(word)) %>%
distinct(word, word_length, author) %>%
arrange(-word_length)
NA
We can get the average word length for each author by using mean() on word_length:
suicide_words %>%
group_by(author) %>%
mutate(word_length = nchar(word)) %>%
summarize(mean_word_length = mean(word_length)) %>%
arrange(-mean_word_length)
Graph word length distributions for all notes.
suicide_words %>%
mutate(word_length = nchar(word)) %>%
ggplot(aes(word_length)) +
geom_histogram(binwidth = 1)
Recreate the above graph, but make two additions:
1. Add the following line: facet_wrap(vars(author), scales = “free_y”). This will create mini graphs for each author.
2. Add a line with labs(title = "") and then put a title in the quotes.
To get a sense of the content of suicide notes, we could look at the most common words used.
suicide_words %>%
count(word, sort = T)
NA
Copy-paste the above code below, and then separate the word counts by author by adding the following line between the two lines above: group_by(author) %>% .
suicide_words %>%
group_by(author) %>%
count(word, sort = T)
NA
Graph the most common words for each author by copy-pasting the above code on top of the following. Make sure to connect it with the pipe.
suicide_words %>%
group_by(author) %>%
count(word, sort = T) %>%
top_n(5) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = author)) +
geom_col(show.legend = FALSE) +
coord_flip() +
facet_wrap(~author, scales = "free") + # creates separate graphs for each author
scale_fill_viridis_d() + # uses a nicer color scheme
theme_minimal() + # removes the gray background
labs(x = NULL, y = "Most common words")
Selecting by n
The problem is that the most common words are not very interesting: the, of, and, etc. These are called stop words, and tidytext includes a file with them so you can remove them. Load them and view them:
stop_words <- get_stopwords()
stop_words$word
[1] "i" "me" "my" "myself" "we" "our" "ours" "ourselves"
[9] "you" "your" "yours" "yourself" "yourselves" "he" "him" "his"
[17] "himself" "she" "her" "hers" "herself" "it" "its" "itself"
[25] "they" "them" "their" "theirs" "themselves" "what" "which" "who"
[33] "whom" "this" "that" "these" "those" "am" "is" "are"
[41] "was" "were" "be" "been" "being" "have" "has" "had"
[49] "having" "do" "does" "did" "doing" "would" "should" "could"
[57] "ought" "i'm" "you're" "he's" "she's" "it's" "we're" "they're"
[65] "i've" "you've" "we've" "they've" "i'd" "you'd" "he'd" "she'd"
[73] "we'd" "they'd" "i'll" "you'll" "he'll" "she'll" "we'll" "they'll"
[81] "isn't" "aren't" "wasn't" "weren't" "hasn't" "haven't" "hadn't" "doesn't"
[89] "don't" "didn't" "won't" "wouldn't" "shan't" "shouldn't" "can't" "cannot"
[97] "couldn't" "mustn't" "let's" "that's" "who's" "what's" "here's" "there's"
[105] "when's" "where's" "why's" "how's" "a" "an" "the" "and"
[113] "but" "if" "or" "because" "as" "until" "while" "of"
[121] "at" "by" "for" "with" "about" "against" "between" "into"
[129] "through" "during" "before" "after" "above" "below" "to" "from"
[137] "up" "down" "in" "out" "on" "off" "over" "under"
[145] "again" "further" "then" "once" "here" "there" "when" "where"
[153] "why" "how" "all" "any" "both" "each" "few" "more"
[161] "most" "other" "some" "such" "no" "nor" "not" "only"
[169] "own" "same" "so" "than" "too" "very" "will"
Use anti_join to remove all stopwords. anti_join() is the opposite of join: it will find the words in common between the two data frames (words and stop_words in this case), remove them, and leave all other words.
suicide_words %>%
anti_join(stop_words)
Joining, by = "word"
To count these words, copy-paste the above code below, and pipe it to the following line: count(word, sort = T)
Separated by author.
suicide_words %>%
anti_join(stop_words) %>%
group_by(author) %>%
count(word, sort = T)
Joining, by = "word"
Copy the code above and pipe it into the following, which creates a graph of the most common words in each note, but now with the stop words removed:
suicide_words %>%
anti_join(stop_words) %>%
group_by(author) %>%
count(word, sort = T) %>%
top_n(5) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = author)) +
geom_col(show.legend = FALSE) +
labs(x = NULL, y = "Most common words") +
facet_wrap(vars(author), scales = "free") +
scale_fill_viridis_d() +
theme_minimal() +
coord_flip()
Joining, by = "word"
Selecting by n
Term frequency-Inverse Document Frequency (TF-IDF) is a measure of the importance of a word in one document relative to other documents. It finds the words that are unique to one document or author. It’s a mouthful so let’s break it down:
Term frequency (TF) = the number of times a term appears in a document Document frequency (DF) = the number of other documents that contain the word Inverse document frequency = 1/DF.
TF-IDF = TF * IDF
So it’s a measure of how often a word appears in one document, divided by how often it appears in other documents.
For example, say we have 10 web pages. If the word ‘the’ appears in one web page 12 times, its TF = 12. If ‘the’ appears in all 10 of the web pages, its DF = 10 and its IDF = 1/10. That makes its TF-IDF = 12 * 1/10 or 1.2.
But if the word ‘love’ appears 8 times on one web page, but appears in just 3 of the web pages total, its TF-IDF would be 8 * 1/3 = 8/3 or 2.7.
Notice that the word ‘love’ has a higher TF-IDF than the common word ‘the’. ‘Love’ appears often on that page and doesn’t appear in all of the other pages, which makes it important for that particular page.
The math is actually a little more complicated than that, and there are variations on the basic formulas, but that’s the principle.
See the chapter on TF-IDF in Text Mining with R for more information: https://www.tidytextmining.com/tfidf.html.
The following large code chunk does all of the calculations for TF-IDF in a couple of steps, and then shows a table of them.
suicide_word_counts <- suicide_notes %>% # This counts each word per author
unnest_tokens(word, text) %>%
count(author, word, sort = TRUE)
total_words <- suicide_word_counts %>% # This counts total words per author
group_by(author) %>%
summarize(total = sum(n))
suicide_word_counts <- left_join(suicide_word_counts, total_words) # Joins the two
Joining, by = "author"
suicide_tf_idf <- suicide_word_counts %>% # Calculates tf-idf
bind_tf_idf(word, author, n)
suicide_tf_idf %>% # Displays it
arrange(-tf_idf)
NA
Graph it.
suicide_tf_idf %>%
arrange(-tf_idf) %>%
mutate(word = factor(word, levels = rev(unique(word)))) %>%
group_by(author) %>%
top_n(5) %>%
ggplot(aes(word, tf_idf, fill = author)) +
geom_col(show.legend = FALSE) +
labs(x = NULL, y = "tf-idf") +
facet_wrap(~author, scales = "free") +
coord_flip()
Selecting by tf_idf
Notice that, although we did not remove the stopwords for this, tf-idf automatically excludes most stop words because they will appear in all of the notes.
To clean this graph up a little, add the following lines:
1. theme_minimal(), which will get rid of the grey background,
2. scale_fill_viridis_d(), which will use an improved color palette 3. labs(title = “Most distinctive words in each suicide note”)
suicide_tf_idf %>%
arrange(-tf_idf) %>%
mutate(word = factor(word, levels = rev(unique(word)))) %>%
group_by(author) %>%
top_n(5) %>%
ggplot(aes(word, tf_idf, fill = author)) +
geom_col(show.legend = FALSE) +
labs(x = NULL, y = "tf-idf") +
facet_wrap(~author, scales = "free") +
coord_flip()
Selecting by tf_idf
theme_minimal()
List of 59
$ line :List of 6
..$ colour : chr "black"
..$ size : num 0.5
..$ linetype : num 1
..$ lineend : chr "butt"
..$ arrow : logi FALSE
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_line" "element"
$ rect :List of 5
..$ fill : chr "white"
..$ colour : chr "black"
..$ size : num 0.5
..$ linetype : num 1
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_rect" "element"
$ text :List of 11
..$ family : chr ""
..$ face : chr "plain"
..$ colour : chr "black"
..$ size : num 11
..$ hjust : num 0.5
..$ vjust : num 0.5
..$ angle : num 0
..$ lineheight : num 0.9
..$ margin : 'margin' num [1:4] 0pt 0pt 0pt 0pt
.. ..- attr(*, "valid.unit")= int 8
.. ..- attr(*, "unit")= chr "pt"
..$ debug : logi FALSE
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ axis.title.x :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : NULL
..$ hjust : NULL
..$ vjust : num 1
..$ angle : NULL
..$ lineheight : NULL
..$ margin : 'margin' num [1:4] 2.75pt 0pt 0pt 0pt
.. ..- attr(*, "valid.unit")= int 8
.. ..- attr(*, "unit")= chr "pt"
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ axis.title.x.top :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : NULL
..$ hjust : NULL
..$ vjust : num 0
..$ angle : NULL
..$ lineheight : NULL
..$ margin : 'margin' num [1:4] 0pt 0pt 2.75pt 0pt
.. ..- attr(*, "valid.unit")= int 8
.. ..- attr(*, "unit")= chr "pt"
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ axis.title.y :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : NULL
..$ hjust : NULL
..$ vjust : num 1
..$ angle : num 90
..$ lineheight : NULL
..$ margin : 'margin' num [1:4] 0pt 2.75pt 0pt 0pt
.. ..- attr(*, "valid.unit")= int 8
.. ..- attr(*, "unit")= chr "pt"
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ axis.title.y.right :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : NULL
..$ hjust : NULL
..$ vjust : num 0
..$ angle : num -90
..$ lineheight : NULL
..$ margin : 'margin' num [1:4] 0pt 0pt 0pt 2.75pt
.. ..- attr(*, "valid.unit")= int 8
.. ..- attr(*, "unit")= chr "pt"
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ axis.text :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : chr "grey30"
..$ size : 'rel' num 0.8
..$ hjust : NULL
..$ vjust : NULL
..$ angle : NULL
..$ lineheight : NULL
..$ margin : NULL
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ axis.text.x :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : NULL
..$ hjust : NULL
..$ vjust : num 1
..$ angle : NULL
..$ lineheight : NULL
..$ margin : 'margin' num [1:4] 2.2pt 0pt 0pt 0pt
.. ..- attr(*, "valid.unit")= int 8
.. ..- attr(*, "unit")= chr "pt"
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ axis.text.x.top :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : NULL
..$ hjust : NULL
..$ vjust : num 0
..$ angle : NULL
..$ lineheight : NULL
..$ margin : 'margin' num [1:4] 0pt 0pt 2.2pt 0pt
.. ..- attr(*, "valid.unit")= int 8
.. ..- attr(*, "unit")= chr "pt"
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ axis.text.y :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : NULL
..$ hjust : num 1
..$ vjust : NULL
..$ angle : NULL
..$ lineheight : NULL
..$ margin : 'margin' num [1:4] 0pt 2.2pt 0pt 0pt
.. ..- attr(*, "valid.unit")= int 8
.. ..- attr(*, "unit")= chr "pt"
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ axis.text.y.right :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : NULL
..$ hjust : num 0
..$ vjust : NULL
..$ angle : NULL
..$ lineheight : NULL
..$ margin : 'margin' num [1:4] 0pt 0pt 0pt 2.2pt
.. ..- attr(*, "valid.unit")= int 8
.. ..- attr(*, "unit")= chr "pt"
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ axis.ticks : list()
..- attr(*, "class")= chr [1:2] "element_blank" "element"
$ axis.ticks.length : 'unit' num 2.75pt
..- attr(*, "valid.unit")= int 8
..- attr(*, "unit")= chr "pt"
$ axis.line : list()
..- attr(*, "class")= chr [1:2] "element_blank" "element"
$ axis.line.x : NULL
$ axis.line.y : NULL
$ legend.background : list()
..- attr(*, "class")= chr [1:2] "element_blank" "element"
$ legend.margin : 'margin' num [1:4] 5.5pt 5.5pt 5.5pt 5.5pt
..- attr(*, "valid.unit")= int 8
..- attr(*, "unit")= chr "pt"
$ legend.spacing : 'unit' num 11pt
..- attr(*, "valid.unit")= int 8
..- attr(*, "unit")= chr "pt"
$ legend.spacing.x : NULL
$ legend.spacing.y : NULL
$ legend.key : list()
..- attr(*, "class")= chr [1:2] "element_blank" "element"
$ legend.key.size : 'unit' num 1.2lines
..- attr(*, "valid.unit")= int 3
..- attr(*, "unit")= chr "lines"
$ legend.key.height : NULL
$ legend.key.width : NULL
$ legend.text :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : 'rel' num 0.8
..$ hjust : NULL
..$ vjust : NULL
..$ angle : NULL
..$ lineheight : NULL
..$ margin : NULL
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ legend.text.align : NULL
$ legend.title :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : NULL
..$ hjust : num 0
..$ vjust : NULL
..$ angle : NULL
..$ lineheight : NULL
..$ margin : NULL
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ legend.title.align : NULL
$ legend.position : chr "right"
$ legend.direction : NULL
$ legend.justification : chr "center"
$ legend.box : NULL
$ legend.box.margin : 'margin' num [1:4] 0cm 0cm 0cm 0cm
..- attr(*, "valid.unit")= int 1
..- attr(*, "unit")= chr "cm"
$ legend.box.background: list()
..- attr(*, "class")= chr [1:2] "element_blank" "element"
$ legend.box.spacing : 'unit' num 11pt
..- attr(*, "valid.unit")= int 8
..- attr(*, "unit")= chr "pt"
$ panel.background : list()
..- attr(*, "class")= chr [1:2] "element_blank" "element"
$ panel.border : list()
..- attr(*, "class")= chr [1:2] "element_blank" "element"
$ panel.spacing : 'unit' num 5.5pt
..- attr(*, "valid.unit")= int 8
..- attr(*, "unit")= chr "pt"
$ panel.spacing.x : NULL
$ panel.spacing.y : NULL
$ panel.grid :List of 6
..$ colour : chr "grey92"
..$ size : NULL
..$ linetype : NULL
..$ lineend : NULL
..$ arrow : logi FALSE
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_line" "element"
$ panel.grid.minor :List of 6
..$ colour : NULL
..$ size : 'rel' num 0.5
..$ linetype : NULL
..$ lineend : NULL
..$ arrow : logi FALSE
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_line" "element"
$ panel.ontop : logi FALSE
$ plot.background : list()
..- attr(*, "class")= chr [1:2] "element_blank" "element"
$ plot.title :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : 'rel' num 1.2
..$ hjust : num 0
..$ vjust : num 1
..$ angle : NULL
..$ lineheight : NULL
..$ margin : 'margin' num [1:4] 0pt 0pt 5.5pt 0pt
.. ..- attr(*, "valid.unit")= int 8
.. ..- attr(*, "unit")= chr "pt"
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ plot.subtitle :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : NULL
..$ hjust : num 0
..$ vjust : num 1
..$ angle : NULL
..$ lineheight : NULL
..$ margin : 'margin' num [1:4] 0pt 0pt 5.5pt 0pt
.. ..- attr(*, "valid.unit")= int 8
.. ..- attr(*, "unit")= chr "pt"
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ plot.caption :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : 'rel' num 0.8
..$ hjust : num 1
..$ vjust : num 1
..$ angle : NULL
..$ lineheight : NULL
..$ margin : 'margin' num [1:4] 5.5pt 0pt 0pt 0pt
.. ..- attr(*, "valid.unit")= int 8
.. ..- attr(*, "unit")= chr "pt"
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ plot.tag :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : 'rel' num 1.2
..$ hjust : num 0.5
..$ vjust : num 0.5
..$ angle : NULL
..$ lineheight : NULL
..$ margin : NULL
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ plot.tag.position : chr "topleft"
$ plot.margin : 'margin' num [1:4] 5.5pt 5.5pt 5.5pt 5.5pt
..- attr(*, "valid.unit")= int 8
..- attr(*, "unit")= chr "pt"
$ strip.background : list()
..- attr(*, "class")= chr [1:2] "element_blank" "element"
$ strip.placement : chr "inside"
$ strip.text :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : chr "grey10"
..$ size : 'rel' num 0.8
..$ hjust : NULL
..$ vjust : NULL
..$ angle : NULL
..$ lineheight : NULL
..$ margin : 'margin' num [1:4] 4.4pt 4.4pt 4.4pt 4.4pt
.. ..- attr(*, "valid.unit")= int 8
.. ..- attr(*, "unit")= chr "pt"
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ strip.text.x : NULL
$ strip.text.y :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : NULL
..$ hjust : NULL
..$ vjust : NULL
..$ angle : num -90
..$ lineheight : NULL
..$ margin : NULL
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ strip.switch.pad.grid: 'unit' num 2.75pt
..- attr(*, "valid.unit")= int 8
..- attr(*, "unit")= chr "pt"
$ strip.switch.pad.wrap: 'unit' num 2.75pt
..- attr(*, "valid.unit")= int 8
..- attr(*, "unit")= chr "pt"
- attr(*, "class")= chr [1:2] "theme" "gg"
- attr(*, "complete")= logi TRUE
- attr(*, "validate")= logi TRUE
scale_fill_viridis_d()
<ggproto object: Class ScaleDiscrete, Scale, gg>
aesthetics: fill
axis_order: function
break_info: function
break_positions: function
breaks: waiver
call: call
clone: function
dimension: function
drop: TRUE
expand: waiver
get_breaks: function
get_breaks_minor: function
get_labels: function
get_limits: function
guide: legend
is_discrete: function
is_empty: function
labels: waiver
limits: NULL
make_sec_title: function
make_title: function
map: function
map_df: function
n.breaks.cache: NULL
na.translate: TRUE
na.value: NA
name: waiver
palette: function
palette.cache: NULL
position: left
range: <ggproto object: Class RangeDiscrete, Range, gg>
range: NULL
reset: function
train: function
super: <ggproto object: Class RangeDiscrete, Range, gg>
reset: function
scale_name: viridis_d
train: function
train_df: function
transform: function
transform_df: function
super: <ggproto object: Class ScaleDiscrete, Scale, gg>
labs(title = "Most distinctive words in each suicide note")
$title
[1] "Most distinctive words in each suicide note"
attr(,"class")
[1] "labels"
Assignment: Read in the file called manifestos.xslx. It contains the writings of several mass killers, incuding the Unabomber, Anders Breivik who killed 70+ people in Norway, Pekka-Eric Auvinen a school shooter from Finland, Elliot Rodger who killed people in California, Seung-Hui Cho who killed people at Virginia Tech, and Chris Harper-Mercer who killed people at a college in Oregon. (I collected these writings and put them into an excel file. Breivik wrote the most by far; I took only a small portion of his writings.)
terror_notes <- read_excel("manifestos.xlsx")
terror_notes
terror_words <- terror_notes %>%
unnest_tokens(word, text)
terror_words
step 1
terror_words %>%
group_by(author) %>%
summarise(num_words = n(),
lex_diversity = n_distinct(word),
lex_density = n_distinct(word)/n())
terror_words %>%
group_by(author) %>%
mutate(word_length = nchar(word)) %>%
summarize(mean_word_length = mean(word_length)) %>%
arrange(-mean_word_length)
terror_words %>%
group_by(author) %>%
count(word, sort = T) %>%
top_n(5) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = author)) +
geom_col(show.legend = FALSE) +
coord_flip() +
facet_wrap(~author, scales = "free") + # creates separate graphs for each author
scale_fill_viridis_d() + # uses a nicer color scheme
theme_minimal() + # removes the gray background
labs(x = NULL, y = "Most common words")
Selecting by n
terror_word_counts <- terror_notes %>% # This counts each word per author
unnest_tokens(word, text) %>%
count(author, word, sort = TRUE)
total_words <- terror_word_counts %>% # This counts total words per author
group_by(author) %>%
summarize(total = sum(n))
terror_word_counts <- left_join(terror_word_counts, total_words) # Joins the two
Joining, by = "author"
terror_tf_idf <- terror_word_counts %>% # Calculates tf-idf
bind_tf_idf(word, author, n)
terror_tf_idf %>% # Displays it
arrange(-tf_idf)
NA
terror_tf_idf %>%
arrange(-tf_idf) %>%
mutate(word = factor(word, levels = rev(unique(word)))) %>%
group_by(author) %>%
top_n(5) %>%
ggplot(aes(word, tf_idf, fill = author)) +
geom_col(show.legend = FALSE) +
labs(x = NULL, y = "tf-idf") +
facet_wrap(~author, scales = "free") +
coord_flip()
Selecting by tf_idf
theme_minimal()
List of 59
$ line :List of 6
..$ colour : chr "black"
..$ size : num 0.5
..$ linetype : num 1
..$ lineend : chr "butt"
..$ arrow : logi FALSE
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_line" "element"
$ rect :List of 5
..$ fill : chr "white"
..$ colour : chr "black"
..$ size : num 0.5
..$ linetype : num 1
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_rect" "element"
$ text :List of 11
..$ family : chr ""
..$ face : chr "plain"
..$ colour : chr "black"
..$ size : num 11
..$ hjust : num 0.5
..$ vjust : num 0.5
..$ angle : num 0
..$ lineheight : num 0.9
..$ margin : 'margin' num [1:4] 0pt 0pt 0pt 0pt
.. ..- attr(*, "valid.unit")= int 8
.. ..- attr(*, "unit")= chr "pt"
..$ debug : logi FALSE
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ axis.title.x :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : NULL
..$ hjust : NULL
..$ vjust : num 1
..$ angle : NULL
..$ lineheight : NULL
..$ margin : 'margin' num [1:4] 2.75pt 0pt 0pt 0pt
.. ..- attr(*, "valid.unit")= int 8
.. ..- attr(*, "unit")= chr "pt"
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ axis.title.x.top :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : NULL
..$ hjust : NULL
..$ vjust : num 0
..$ angle : NULL
..$ lineheight : NULL
..$ margin : 'margin' num [1:4] 0pt 0pt 2.75pt 0pt
.. ..- attr(*, "valid.unit")= int 8
.. ..- attr(*, "unit")= chr "pt"
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ axis.title.y :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : NULL
..$ hjust : NULL
..$ vjust : num 1
..$ angle : num 90
..$ lineheight : NULL
..$ margin : 'margin' num [1:4] 0pt 2.75pt 0pt 0pt
.. ..- attr(*, "valid.unit")= int 8
.. ..- attr(*, "unit")= chr "pt"
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ axis.title.y.right :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : NULL
..$ hjust : NULL
..$ vjust : num 0
..$ angle : num -90
..$ lineheight : NULL
..$ margin : 'margin' num [1:4] 0pt 0pt 0pt 2.75pt
.. ..- attr(*, "valid.unit")= int 8
.. ..- attr(*, "unit")= chr "pt"
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ axis.text :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : chr "grey30"
..$ size : 'rel' num 0.8
..$ hjust : NULL
..$ vjust : NULL
..$ angle : NULL
..$ lineheight : NULL
..$ margin : NULL
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ axis.text.x :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : NULL
..$ hjust : NULL
..$ vjust : num 1
..$ angle : NULL
..$ lineheight : NULL
..$ margin : 'margin' num [1:4] 2.2pt 0pt 0pt 0pt
.. ..- attr(*, "valid.unit")= int 8
.. ..- attr(*, "unit")= chr "pt"
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ axis.text.x.top :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : NULL
..$ hjust : NULL
..$ vjust : num 0
..$ angle : NULL
..$ lineheight : NULL
..$ margin : 'margin' num [1:4] 0pt 0pt 2.2pt 0pt
.. ..- attr(*, "valid.unit")= int 8
.. ..- attr(*, "unit")= chr "pt"
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ axis.text.y :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : NULL
..$ hjust : num 1
..$ vjust : NULL
..$ angle : NULL
..$ lineheight : NULL
..$ margin : 'margin' num [1:4] 0pt 2.2pt 0pt 0pt
.. ..- attr(*, "valid.unit")= int 8
.. ..- attr(*, "unit")= chr "pt"
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ axis.text.y.right :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : NULL
..$ hjust : num 0
..$ vjust : NULL
..$ angle : NULL
..$ lineheight : NULL
..$ margin : 'margin' num [1:4] 0pt 0pt 0pt 2.2pt
.. ..- attr(*, "valid.unit")= int 8
.. ..- attr(*, "unit")= chr "pt"
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ axis.ticks : list()
..- attr(*, "class")= chr [1:2] "element_blank" "element"
$ axis.ticks.length : 'unit' num 2.75pt
..- attr(*, "valid.unit")= int 8
..- attr(*, "unit")= chr "pt"
$ axis.line : list()
..- attr(*, "class")= chr [1:2] "element_blank" "element"
$ axis.line.x : NULL
$ axis.line.y : NULL
$ legend.background : list()
..- attr(*, "class")= chr [1:2] "element_blank" "element"
$ legend.margin : 'margin' num [1:4] 5.5pt 5.5pt 5.5pt 5.5pt
..- attr(*, "valid.unit")= int 8
..- attr(*, "unit")= chr "pt"
$ legend.spacing : 'unit' num 11pt
..- attr(*, "valid.unit")= int 8
..- attr(*, "unit")= chr "pt"
$ legend.spacing.x : NULL
$ legend.spacing.y : NULL
$ legend.key : list()
..- attr(*, "class")= chr [1:2] "element_blank" "element"
$ legend.key.size : 'unit' num 1.2lines
..- attr(*, "valid.unit")= int 3
..- attr(*, "unit")= chr "lines"
$ legend.key.height : NULL
$ legend.key.width : NULL
$ legend.text :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : 'rel' num 0.8
..$ hjust : NULL
..$ vjust : NULL
..$ angle : NULL
..$ lineheight : NULL
..$ margin : NULL
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ legend.text.align : NULL
$ legend.title :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : NULL
..$ hjust : num 0
..$ vjust : NULL
..$ angle : NULL
..$ lineheight : NULL
..$ margin : NULL
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ legend.title.align : NULL
$ legend.position : chr "right"
$ legend.direction : NULL
$ legend.justification : chr "center"
$ legend.box : NULL
$ legend.box.margin : 'margin' num [1:4] 0cm 0cm 0cm 0cm
..- attr(*, "valid.unit")= int 1
..- attr(*, "unit")= chr "cm"
$ legend.box.background: list()
..- attr(*, "class")= chr [1:2] "element_blank" "element"
$ legend.box.spacing : 'unit' num 11pt
..- attr(*, "valid.unit")= int 8
..- attr(*, "unit")= chr "pt"
$ panel.background : list()
..- attr(*, "class")= chr [1:2] "element_blank" "element"
$ panel.border : list()
..- attr(*, "class")= chr [1:2] "element_blank" "element"
$ panel.spacing : 'unit' num 5.5pt
..- attr(*, "valid.unit")= int 8
..- attr(*, "unit")= chr "pt"
$ panel.spacing.x : NULL
$ panel.spacing.y : NULL
$ panel.grid :List of 6
..$ colour : chr "grey92"
..$ size : NULL
..$ linetype : NULL
..$ lineend : NULL
..$ arrow : logi FALSE
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_line" "element"
$ panel.grid.minor :List of 6
..$ colour : NULL
..$ size : 'rel' num 0.5
..$ linetype : NULL
..$ lineend : NULL
..$ arrow : logi FALSE
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_line" "element"
$ panel.ontop : logi FALSE
$ plot.background : list()
..- attr(*, "class")= chr [1:2] "element_blank" "element"
$ plot.title :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : 'rel' num 1.2
..$ hjust : num 0
..$ vjust : num 1
..$ angle : NULL
..$ lineheight : NULL
..$ margin : 'margin' num [1:4] 0pt 0pt 5.5pt 0pt
.. ..- attr(*, "valid.unit")= int 8
.. ..- attr(*, "unit")= chr "pt"
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ plot.subtitle :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : NULL
..$ hjust : num 0
..$ vjust : num 1
..$ angle : NULL
..$ lineheight : NULL
..$ margin : 'margin' num [1:4] 0pt 0pt 5.5pt 0pt
.. ..- attr(*, "valid.unit")= int 8
.. ..- attr(*, "unit")= chr "pt"
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ plot.caption :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : 'rel' num 0.8
..$ hjust : num 1
..$ vjust : num 1
..$ angle : NULL
..$ lineheight : NULL
..$ margin : 'margin' num [1:4] 5.5pt 0pt 0pt 0pt
.. ..- attr(*, "valid.unit")= int 8
.. ..- attr(*, "unit")= chr "pt"
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ plot.tag :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : 'rel' num 1.2
..$ hjust : num 0.5
..$ vjust : num 0.5
..$ angle : NULL
..$ lineheight : NULL
..$ margin : NULL
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ plot.tag.position : chr "topleft"
$ plot.margin : 'margin' num [1:4] 5.5pt 5.5pt 5.5pt 5.5pt
..- attr(*, "valid.unit")= int 8
..- attr(*, "unit")= chr "pt"
$ strip.background : list()
..- attr(*, "class")= chr [1:2] "element_blank" "element"
$ strip.placement : chr "inside"
$ strip.text :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : chr "grey10"
..$ size : 'rel' num 0.8
..$ hjust : NULL
..$ vjust : NULL
..$ angle : NULL
..$ lineheight : NULL
..$ margin : 'margin' num [1:4] 4.4pt 4.4pt 4.4pt 4.4pt
.. ..- attr(*, "valid.unit")= int 8
.. ..- attr(*, "unit")= chr "pt"
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ strip.text.x : NULL
$ strip.text.y :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : NULL
..$ hjust : NULL
..$ vjust : NULL
..$ angle : num -90
..$ lineheight : NULL
..$ margin : NULL
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ strip.switch.pad.grid: 'unit' num 2.75pt
..- attr(*, "valid.unit")= int 8
..- attr(*, "unit")= chr "pt"
$ strip.switch.pad.wrap: 'unit' num 2.75pt
..- attr(*, "valid.unit")= int 8
..- attr(*, "unit")= chr "pt"
- attr(*, "class")= chr [1:2] "theme" "gg"
- attr(*, "complete")= logi TRUE
- attr(*, "validate")= logi TRUE
scale_fill_viridis_d()
<ggproto object: Class ScaleDiscrete, Scale, gg>
aesthetics: fill
axis_order: function
break_info: function
break_positions: function
breaks: waiver
call: call
clone: function
dimension: function
drop: TRUE
expand: waiver
get_breaks: function
get_breaks_minor: function
get_labels: function
get_limits: function
guide: legend
is_discrete: function
is_empty: function
labels: waiver
limits: NULL
make_sec_title: function
make_title: function
map: function
map_df: function
n.breaks.cache: NULL
na.translate: TRUE
na.value: NA
name: waiver
palette: function
palette.cache: NULL
position: left
range: <ggproto object: Class RangeDiscrete, Range, gg>
range: NULL
reset: function
train: function
super: <ggproto object: Class RangeDiscrete, Range, gg>
reset: function
scale_name: viridis_d
train: function
train_df: function
transform: function
transform_df: function
super: <ggproto object: Class ScaleDiscrete, Scale, gg>
labs(title = "Most distinctive words in each terror note")
$title
[1] "Most distinctive words in each terror note"
attr(,"class")
[1] "labels"