Source: University of Virginia Library, Leah Malkovich. Nov 27, 2018.
Quanteda has 3 major components of text:
The corpus is the entire text body object, such as a book or chapter
of a book. Document feature matrix organizes tokenized words into
columns, which makes doing analysis easier and is viewable with
view() function. Tokens are each word in the corpus
separated from the corpus.
The corpus used in this guide will be from Project Gutenberg and will use Pride and Prejudice by Jane Austen and A Tale of Two Cities by Charles Dickens.
# if you need to install
# install.packages('quanteda')
library(quanteda)
library(tidyverse)
# readtext allows for .txt, .csv, .json, .doc and .pdf files
library(readtext)
# Project Gutenberg books
library(gutenbergr)
# create a temprary directory to store texts
dir.create("tmp")
# download texts
download.file(url = "https://www.gutenberg.org/files/1342/1342-0.txt",
destfile = "tmp/Pride and Prejudice_Jane Austen_2008_English.txt")
trying URL 'https://www.gutenberg.org/files/1342/1342-0.txt'
Content type 'text/plain' length 798774 bytes (780 KB)
==================================================
downloaded 780 KB
download.file(url = "https://www.gutenberg.org/files/98/98-0.txt",
destfile = "tmp/A Tale of Two Cities_Charles Dickens_2009_English.txt")
trying URL 'https://www.gutenberg.org/files/98/98-0.txt'
Content type 'text/plain' length 807231 bytes (788 KB)
==================================================
downloaded 788 KB
# read in texts
dataframe <- readtext("tmp/*.txt",
docvarsfrom = "filenames",
docvarnames = c("title", "author",
"year uploaded", "language"),
dvsep = "_",
encoding = "UTF-8")
# delete tmp directory
unlink("tmp", recursive = TRUE)
doc_corpus = corpus(dataframe)
summary(doc_corpus)
Corpus consisting of 2 documents, showing 2 documents:
Text Types Tokens Sentences
A Tale of Two Cities_Charles Dickens_2009_English.txt 11584 170042 7931
Pride and Prejudice_Jane Austen_2008_English.txt 7469 147567 6213
title author year.uploaded language
A Tale of Two Cities Charles Dickens 2009 English
Pride and Prejudice Jane Austen 2008 English
Need to clean and tokenize the corpus. Tokenize is when the sentences are split up into individual words.
doc_tokens = tokens(doc_corpus)
doc_tokens
Tokens consisting of 2 documents and 4 docvars.
A Tale of Two Cities_Charles Dickens_2009_English.txt :
[1] "The" "Project" "Gutenberg" "eBook" "of" "A" "Tale"
[8] "of" "Two" "Cities" "," "by"
[ ... and 170,030 more ]
Pride and Prejudice_Jane Austen_2008_English.txt :
[1] "The" "Project" "Gutenberg" "eBook" "of" "Pride" "and"
[8] "Prejudice" "," "by" "Jane" "Austen"
[ ... and 147,555 more ]
Now we have the document corpus tokenized, we can rerun the code to further clean the tokens by removing punctuations and numbers
doc_tokens = tokens(doc_tokens,
remove_punct = TRUE,
remove_numbers = TRUE,
remove_symbols = TRUE
)
doc_tokens
Tokens consisting of 2 documents and 4 docvars.
A Tale of Two Cities_Charles Dickens_2009_English.txt :
[1] "The" "Project" "Gutenberg" "eBook" "of" "A" "Tale"
[8] "of" "Two" "Cities" "by" "Charles"
[ ... and 139,561 more ]
Pride and Prejudice_Jane Austen_2008_English.txt :
[1] "The" "Project" "Gutenberg" "eBook" "of" "Pride" "and"
[8] "Prejudice" "by" "Jane" "Austen" "This"
[ ... and 124,919 more ]
Now we need to remove STOPWORDS from our tokens, stop words are “the”, “and”, “it”, etc and by removing the these words helps have better analysis.
doc_tokens = tokens_select(doc_tokens,
stopwords('english'), # make sure of spelling for stopwords
selection = 'remove'
)
doc_tokens
Tokens consisting of 2 documents and 4 docvars.
A Tale of Two Cities_Charles Dickens_2009_English.txt :
[1] "Project" "Gutenberg" "eBook" "Tale" "Two" "Cities" "Charles"
[8] "Dickens" "eBook" "use" "anyone" "anywhere"
[ ... and 65,212 more ]
Pride and Prejudice_Jane Austen_2008_English.txt :
[1] "Project" "Gutenberg" "eBook" "Pride" "Prejudice" "Jane" "Austen"
[8] "eBook" "use" "anyone" "anywhere" "United"
[ ... and 56,325 more ]
Stemming is the stem of a word, such as the word “dance”, the stem is
danc and the endings could be + ing ,
ed in a text.
doc_tokens = tokens_wordstem(doc_tokens)
doc_tokens
Tokens consisting of 2 documents and 4 docvars.
A Tale of Two Cities_Charles Dickens_2009_English.txt :
[1] "Project" "Gutenberg" "eBook" "Tale" "Two" "Citi" "Charl"
[8] "Dicken" "eBook" "use" "anyon" "anywher"
[ ... and 65,212 more ]
Pride and Prejudice_Jane Austen_2008_English.txt :
[1] "Project" "Gutenberg" "eBook" "Pride" "Prejudic" "Jane" "Austen"
[8] "eBook" "use" "anyon" "anywher" "Unite"
[ ... and 56,325 more ]
Now we could make all the word stems lowercase for standardization
doc_tokens = tokens_tolower(doc_tokens)
doc_tokens
Tokens consisting of 2 documents and 4 docvars.
A Tale of Two Cities_Charles Dickens_2009_English.txt :
[1] "project" "gutenberg" "ebook" "tale" "two" "citi" "charl"
[8] "dicken" "ebook" "use" "anyon" "anywher"
[ ... and 65,212 more ]
Pride and Prejudice_Jane Austen_2008_English.txt :
[1] "project" "gutenberg" "ebook" "pride" "prejudic" "jane" "austen"
[8] "ebook" "use" "anyon" "anywher" "unite"
[ ... and 56,325 more ]
Summary of the word tokens
summary(doc_tokens)
Length Class Mode
A Tale of Two Cities_Charles Dickens_2009_English.txt 65224 -none- character
Pride and Prejudice_Jane Austen_2008_English.txt 56337 -none- character
After tokenized and cleaning the tokens into stems, convert to dfm
doc_dfm = dfm(doc_tokens)
doc_dfm
Document-feature matrix of: 2 documents, 7,972 features (31.27% sparse) and 4 docvars.
features
docs project gutenberg ebook tale two
A Tale of Two Cities_Charles Dickens_2009_English.txt 91 31 20 6 214
Pride and Prejudice_Jane Austen_2008_English.txt 90 31 21 0 131
features
docs citi charl dicken use anyon
A Tale of Two Cities_Charles Dickens_2009_English.txt 41 102 3 78 5
Pride and Prejudice_Jane Austen_2008_English.txt 2 7 0 63 26
[ reached max_nfeat ... 7,962 more features ]
We could use the kwic() function which is the
keywords-in-context function which shows specific words
in the context in which they appear.
If we wanted to know where the word love is in the
document corpus, and to get a small sentence context of where the word
is used we use window = n. How many windows is how many
words surround the keyword in search. The kwic() function
returns the location of where each specific instance of the word is.
# teh doc_dfm will return an error, this function needs the tokenized corpus
head( kwic(doc_tokens, pattern = "love", window = 3 ))
Keyword-in-context with 6 matches.
NA
What are the most used words in the document corpus ? We can use the
topfeatures() function. This function takes n
number of features, so for a top 10, n = 10.
# need to use the dfm
topfeatures(doc_dfm,
n= 10,
decreasing = TRUE
)
mr said one look elizabeth time know miss much
1406 1063 721 646 634 541 540 525 513
now
465
If you wanted word frequencies
# need to use the dfm
topfeatures(doc_dfm,
n= 10,
decreasing = TRUE,
scheme = 'docfreq'
)
project gutenberg ebook two citi charl use anyon anywher
2 2 2 2 2 2 2 2 2
unite
2
This has been the basic of text analysis using Quanteda, which leaves more to learn from here.