Intro to Text Mining

2022-05-13

Your instructors

Ryan Clement, Data Services Librarian: go/ryan/

Wendy Shook, Science Data Librarian: go/wshook/

Plan for today

Why is text different?
Where can you get text for analysis?
What can you do with text and computers?
What are some non-coding tools for text analysis?
Voyant
R and TidyText

Where this workshop comes from (in part)

text data v. tabular data

why is text data called “unstructured”?
what issues are common with text data?
where can you get text data for your work?

some sources for text data, part i.

Make your own!

Surveys
Transcription of audio/video
Digitizing physical texts

Social Media & Web Data

Twitter
Facebook
Reddit
Web scraping
remember ethical and technical concerns…

some sources for text data, part ii.

Databases

Early English Books Online
- Library version (go/eebo/)^
- Open version
Chronicling America
Project Gutenberg
Oxford Text Archive
Internet Archive Text Archive
JSTOR Constellate^
HathiTrust Research Center

^: These are library databases – usage restriction may apply, always talk with a librarian first!

what can you do with text and computers?

visualize single texts
measure features of texts (diction, sentiment, structure)
compare features of multiple texts (diction, sentiment, structure)
find, organize texts (visualization, mapping, network analysis)
model forms or genres
model structures ‘outside’ of literature (social, historical, etc.)
unsupervised modeling (topic modeling)

“Seven ways humanists are using computers to understand text,” (Underwood, 2015)

tools for working with text

non-coding

Gephi(network analysis)
AntConc
Voyant

low-coding

coding

R (tidytext, etc.)
Python (BeautifulSoup, etc.)

a moment with Voyant

Voyant
Voyant Handout

and now, some R…

Text Mining with R: A Tidy Approach

get to the sample R script

go/humanities-text-mining/

what is tidy text?

tidy data principles?
1. Each variable has its own column
2. Each observation has its own row
3. Each value must have its own cell
for text data, this means a table with one token per row
not all text mining work can use tidy format, some other formats are:
1. Strings – i.e., character vectors (often the way text is imported)
2. Corpus – strings annotated with additional metadata
3. Document-term matrix – a matrix describing a collection of documents with one document per row, one column for each term

converting to tidy text: the `unnest_tokens()` function

text_df %>%
  unnest_tokens(word, text)

text is split into tokens (default is words)
other columns are retained
punctuation is stripped
by default, unnest_tokens() converts the tokens to lowercase, which makes them easier to compare or combine with other datasets (Use the to_lower = FALSE argument to turn off this behavior)

a typical workflow

exercise #1

Use what we’ve just learned to a create a bar chart of the 15 least used words that are used over 100 times in the Jane Austen corpus.

Hint: the tail() function might be quite useful here.

exercise #2

Modify the code we used before for the most used words in the Jane Austen corpus so that we can see a separate graph of the most used words in each of the books in the corpus.

Hint: the facet_grid() function is important here.

the `gutenbergr` package

install.packages('gutenbergr')
library(gutenbergr)
gutenberg_metadata
View(gutenberg_metadata)

check out the package’s documentation at rOpenSci, where it is one of rOpenSci’s packages for data access.
we’ll be using the gutenberg_download() function to download works

sentiment analysis with tidy data

`get_sentiments()` datasets

AFINN from Finn Årup Nielsen
bing from Bing Liu and collaborators
nrc from Saif Mohammad and Peter Turney

NOTE: These lexicons are available under different licenses, so be sure that the license for the lexicon you want to use is appropriate for your project. You may be asked to agree to a license before downloading data.

library(tidytext)
get_sentiments("afinn")

exercise #3

Use the get_sentiments() function to download each of the 3 datasets (AFINN, bing, and nrc). Make sure to read each license before you accept it.

how are sentiment lexica created?

“How were these sentiment lexicons put together and validated? They were constructed via either crowdsourcing (using, for example, Amazon Mechanical Turk) or by the labor of one of the authors, and were validated using some combination of crowdsourcing again, restaurant or movie reviews, or Twitter data. Given this information, we may hesitate to apply these sentiment lexicons to styles of text dramatically different from what they were validated on, such as narrative fiction from 200 years ago. While it is true that using these sentiment lexicons with, for example, Jane Austen’s novels may give us less accurate results than with tweets sent by a contemporary writer, we still can measure the sentiment content for words that are shared across the lexicon and the text.”

— Silge & Robinson, Text Mining with R: A Tidy Approach, 2017

a sidebar on `dplyr` joins

creating an index

The %/% operator does integer division (x %/% y is equivalent to floor(x/y)) so the index keeps track of which 80-line section of text we are counting up negative and positive sentiments in.

comparing the sentiment dictionaries

exercise #4

Write some code that will add the word “miss” to our stop_words tibble. Call the lexicon “custom.”

HINT: the bind_rows() function will be helpful here.

When you’re done, please give a :thumbs-up: or a :green-check: in the Zoom reactions.

analyzing frequencies: tf-idf

\[idf(\text{term}) = \ln{\left(\frac{n_{\text{documents}}}{n_{\text{documents containing term}}}\right)}\]

NOTE: the statistic tf-idf is a rule-of-thumb quantity; while it is very useful (and widely used) in text mining, its theoretical foundations are regularly questioned by information experts.

Zipf’s law

the `bind_tf_idf()` function

The bind_tf_idf() function in the tidytext package takes a tidy text dataset as input with one row per token (term), per document. One column (word for us) contains the terms/tokens, one column contains the documents (book in our case), and the last necessary column contains the counts, how many times each document contains each term (n in our example).

moving beyond single words: n-grams

We can add two arguments (token = "ngrams" and n = n) to our unnest_tokens() function in order to capture n-grams (instead of single words, or unigrams).

austen_bigrams <- austen_books() %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2)

Thank you!

Ryan’s contact info: go/ryan/