This file demonstrates a basic workflow to take some pre-loaded texts and perform elementary text analysis tasks quickly. The quanteda packages comes with a built-in set of inaugural addresses from US Presidents. We begin by loading quanteda and examining these texts. The summary command will output the name of each text along with the number of types, tokens and sentences contained in the text. Below we use R’s indexing syntax to selectivly use the summary command on the first five texts.
require(quanteda)
## Loading required package: quanteda
##
## Attaching package: 'quanteda'
##
## The following object is masked from 'package:base':
##
## sample
summary(inaugTexts[1:5])
## Text Types Tokens Sentences
## 1 1789-Washington 595 1430 24
## 2 1793-Washington 90 135 4
## 3 1797-Adams 794 2318 37
## 4 1801-Jefferson 681 1726 42
## 5 1805-Jefferson 776 2166 45
One of the most fundamenttal text analysis tasks is tokenization. To tokenize a text is to split it into units, most commonly words, which can be counted and to form the basis of a quantitative analysis. The quanteda package has a function for tokenization: tokenize. Examine the manual page for this function.
?tokenize
quanteda’s tokenize function can be used on a simple character vector, a vector of character vectors, or a corpus. Here are some examples:
tokenize('Today is Thursday in Canberra. It is yesterday in London.')
## tokenizedText object from 1 document.
## Component 1 :
## [1] "Today" "is" "Thursday" "in" "Canberra"
## [6] "." "It" "is" "yesterday" "in"
## [11] "London" "."
vec <- c(one='This is text one', two='This, however, is the second text')
tokenize(vec)
## tokenizedText object from 2 documents.
## one :
## [1] "This" "is" "text" "one"
##
## two :
## [1] "This" "," "however" "," "is" "the" "second"
## [8] "text"
Consider the default arguments to the tokenize function. To remove punctuation, you should set the removePunct argument to be TRUE. We can combine this with the toLower function to get a cleaned and tokenized version of our text.
tokenize(toLower(vec), removePunct = TRUE)
## tokenizedText object from 2 documents.
## one :
## [1] "this" "is" "text" "one"
##
## two :
## [1] "this" "however" "is" "the" "second" "text"
Using this function with the inaugural addresses:
inaugTokens <- tokenize(toLower(inaugTexts))
inaugTokens[2]
## $`1793-Washington`
## [1] "fellow" "citizens" "," "i"
## [5] "am" "again" "called" "upon"
## [9] "by" "the" "voice" "of"
## [13] "my" "country" "to" "execute"
## [17] "the" "functions" "of" "its"
## [21] "chief" "magistrate" "." "when"
## [25] "the" "occasion" "proper" "for"
## [29] "it" "shall" "arrive" ","
## [33] "i" "shall" "endeavor" "to"
## [37] "express" "the" "high" "sense"
## [41] "i" "entertain" "of" "this"
## [45] "distinguished" "honor" "," "and"
## [49] "of" "the" "confidence" "which"
## [53] "has" "been" "reposed" "in"
## [57] "me" "by" "the" "people"
## [61] "of" "united" "america" "."
## [65] "previous" "to" "the" "execution"
## [69] "of" "any" "official" "act"
## [73] "of" "the" "president" "the"
## [77] "constitution" "requires" "an" "oath"
## [81] "of" "office" "." "this"
## [85] "oath" "i" "am" "now"
## [89] "about" "to" "take" ","
## [93] "and" "in" "your" "presence"
## [97] ":" "that" "if" "it"
## [101] "shall" "be" "found" "during"
## [105] "my" "administration" "of" "the"
## [109] "government" "i" "have" "in"
## [113] "any" "instance" "violated" "willingly"
## [117] "or" "knowingly" "the" "injunctions"
## [121] "thereof" "," "i" "may"
## [125] "(" "besides" "incurring" "constitutional"
## [129] "punishment" ")" "be" "subject"
## [133] "to" "the" "upbraidings" "of"
## [137] "all" "who" "are" "now"
## [141] "witnesses" "of" "the" "present"
## [145] "solemn" "ceremony" "."
Once each text has been split into words, we can use the dfm function to create a matrix of counts of the occurrences of each word in each document:
inaugDfm <- dfm(inaugTokens)
##
## ... indexing documents: 57 documents
## ... indexing features: 9,174 feature types
## ... created a 57 x 9174 sparse dfm
## ... complete.
## Elapsed time: 0.346 seconds.
Note that dfm() works on a variety of object types, including character vectors, corpus objects, and tokenized text objects. This gives the user maximum flexibility and power, while also making it easy to achieve similar results by going directly from texts to a document-by-feature matrix.
To see what objects for which any particular method (function) is defined, you can use the methods() function:
methods(dfm)
## [1] dfm.character* dfm.corpus* dfm.tokenizedTexts*
## see '?methods' for accessing help and source code
Likewise, you can also figure out what methods are defined for any given class of object, using the same function:
methods(class = "tokenizedTexts")
## [1] dfm kwic ngrams print
## [5] removeFeatures skipgrams syllables toLower
## [9] wordstem
## see '?methods' for accessing help and source code
If we are interested in analysing the texts with respect to some other variables, we can create a corpus object to associate the texts with this metadata. For example, consider the last six inaugural addresses:
summary(inaugTexts[52:57])
## Text Types Tokens Sentences
## 1 1993-Clinton 600 1598 81
## 2 1997-Clinton 719 2157 112
## 3 2001-Bush 585 1584 97
## 4 2005-Bush 725 2071 101
## 5 2009-Obama 893 2390 112
## 6 2013-Obama 781 2097 90
We can use the docvars option to the corpus command to record the party with which each text is associated:
dv <- data.frame(Party = c('dem','dem','rep','rep','dem','dem'))
recentCorpus <- corpus(inaugTexts[52:57], docvars=dv)
summary(recentCorpus)
## Corpus consisting of 6 documents.
##
## Text Types Tokens Sentences Party
## 1993-Clinton 600 1598 81 dem
## 1997-Clinton 719 2157 112 dem
## 2001-Bush 585 1584 97 rep
## 2005-Bush 725 2071 101 rep
## 2009-Obama 893 2390 112 dem
## 2013-Obama 781 2097 90 dem
##
## Source: /Users/ksosulsk/Desktop/ITAUR-master/1_demo/* on x86_64 by ksosulsk
## Created: Mon Oct 19 11:11:43 2015
## Notes:
We can use this metadata to combine features across documents when creating a document-feature matrix:
partyDfm <- dfm(recentCorpus, groups='Party', ignoredFeatures=(stopwords('english')))
## Creating a dfm from a corpus ...
## ... grouping texts by variable: Party
## ... lowercasing
## ... tokenizing
## ... indexing documents: 2 documents
## ... indexing features: 2,303 feature types
## ... removed 115 features, from 174 supplied (fixed) feature types
## ... created a 2 x 2188 sparse dfm
## ... complete.
## Elapsed time: 0.03 seconds.
wordcloud::comparison.cloud(t(as.matrix(partyDfm)))