Example workflow

This file demonstrates a basic workflow to take some pre-loaded texts and perform elementary text analysis tasks quickly. The quanteda packages comes with a built-in set of inaugural addresses from US Presidents. We begin by loading quanteda and examining these texts. The summary command will output the name of each text along with the number of types, tokens and sentences contained in the text. Below we use R’s indexing syntax to selectivly use the summary command on the first five texts.

require(quanteda)

## Loading required package: quanteda
## 
## Attaching package: 'quanteda'
## 
## The following object is masked from 'package:base':
## 
##     sample

summary(inaugTexts[1:5])

##              Text Types Tokens Sentences
## 1 1789-Washington   595   1430        24
## 2 1793-Washington    90    135         4
## 3      1797-Adams   794   2318        37
## 4  1801-Jefferson   681   1726        42
## 5  1805-Jefferson   776   2166        45

One of the most fundamenttal text analysis tasks is tokenization. To tokenize a text is to split it into units, most commonly words, which can be counted and to form the basis of a quantitative analysis. The quanteda package has a function for tokenization: tokenize. Examine the manual page for this function.

?tokenize

quanteda’s tokenize function can be used on a simple character vector, a vector of character vectors, or a corpus. Here are some examples:

tokenize('Today is Thursday in Canberra. It is yesterday in London.')

## tokenizedText object from 1 document.
## Component 1 :
##  [1] "Today"     "is"        "Thursday"  "in"        "Canberra" 
##  [6] "."         "It"        "is"        "yesterday" "in"       
## [11] "London"    "."

vec <- c(one='This is text one', two='This, however, is the second text')
tokenize(vec)

## tokenizedText object from 2 documents.
## one :
## [1] "This" "is"   "text" "one" 
## 
## two :
## [1] "This"    ","       "however" ","       "is"      "the"     "second" 
## [8] "text"

Consider the default arguments to the tokenize function. To remove punctuation, you should set the removePunct argument to be TRUE. We can combine this with the toLower function to get a cleaned and tokenized version of our text.

tokenize(toLower(vec), removePunct = TRUE)

## tokenizedText object from 2 documents.
## one :
## [1] "this" "is"   "text" "one" 
## 
## two :
## [1] "this"    "however" "is"      "the"     "second"  "text"

Using this function with the inaugural addresses:

inaugTokens <- tokenize(toLower(inaugTexts))
inaugTokens[2]

## $`1793-Washington`
##   [1] "fellow"         "citizens"       ","              "i"             
##   [5] "am"             "again"          "called"         "upon"          
##   [9] "by"             "the"            "voice"          "of"            
##  [13] "my"             "country"        "to"             "execute"       
##  [17] "the"            "functions"      "of"             "its"           
##  [21] "chief"          "magistrate"     "."              "when"          
##  [25] "the"            "occasion"       "proper"         "for"           
##  [29] "it"             "shall"          "arrive"         ","             
##  [33] "i"              "shall"          "endeavor"       "to"            
##  [37] "express"        "the"            "high"           "sense"         
##  [41] "i"              "entertain"      "of"             "this"          
##  [45] "distinguished"  "honor"          ","              "and"           
##  [49] "of"             "the"            "confidence"     "which"         
##  [53] "has"            "been"           "reposed"        "in"            
##  [57] "me"             "by"             "the"            "people"        
##  [61] "of"             "united"         "america"        "."             
##  [65] "previous"       "to"             "the"            "execution"     
##  [69] "of"             "any"            "official"       "act"           
##  [73] "of"             "the"            "president"      "the"           
##  [77] "constitution"   "requires"       "an"             "oath"          
##  [81] "of"             "office"         "."              "this"          
##  [85] "oath"           "i"              "am"             "now"           
##  [89] "about"          "to"             "take"           ","             
##  [93] "and"            "in"             "your"           "presence"      
##  [97] ":"              "that"           "if"             "it"            
## [101] "shall"          "be"             "found"          "during"        
## [105] "my"             "administration" "of"             "the"           
## [109] "government"     "i"              "have"           "in"            
## [113] "any"            "instance"       "violated"       "willingly"     
## [117] "or"             "knowingly"      "the"            "injunctions"   
## [121] "thereof"        ","              "i"              "may"           
## [125] "("              "besides"        "incurring"      "constitutional"
## [129] "punishment"     ")"              "be"             "subject"       
## [133] "to"             "the"            "upbraidings"    "of"            
## [137] "all"            "who"            "are"            "now"           
## [141] "witnesses"      "of"             "the"            "present"       
## [145] "solemn"         "ceremony"       "."

Once each text has been split into words, we can use the dfm function to create a matrix of counts of the occurrences of each word in each document:

inaugDfm <- dfm(inaugTokens)

## 
##    ... indexing documents: 57 documents
##    ... indexing features: 9,174 feature types
##    ... created a 57 x 9174 sparse dfm
##    ... complete. 
## Elapsed time: 0.346 seconds.

Note that dfm() works on a variety of object types, including character vectors, corpus objects, and tokenized text objects. This gives the user maximum flexibility and power, while also making it easy to achieve similar results by going directly from texts to a document-by-feature matrix.

To see what objects for which any particular method (function) is defined, you can use the methods() function:

methods(dfm)

## [1] dfm.character*      dfm.corpus*         dfm.tokenizedTexts*
## see '?methods' for accessing help and source code

Likewise, you can also figure out what methods are defined for any given class of object, using the same function:

methods(class = "tokenizedTexts")

## [1] dfm            kwic           ngrams         print         
## [5] removeFeatures skipgrams      syllables      toLower       
## [9] wordstem      
## see '?methods' for accessing help and source code

If we are interested in analysing the texts with respect to some other variables, we can create a corpus object to associate the texts with this metadata. For example, consider the last six inaugural addresses:

summary(inaugTexts[52:57])

##           Text Types Tokens Sentences
## 1 1993-Clinton   600   1598        81
## 2 1997-Clinton   719   2157       112
## 3    2001-Bush   585   1584        97
## 4    2005-Bush   725   2071       101
## 5   2009-Obama   893   2390       112
## 6   2013-Obama   781   2097        90

We can use the docvars option to the corpus command to record the party with which each text is associated:

dv <- data.frame(Party = c('dem','dem','rep','rep','dem','dem'))
recentCorpus <- corpus(inaugTexts[52:57], docvars=dv)
summary(recentCorpus)

## Corpus consisting of 6 documents.
## 
##          Text Types Tokens Sentences Party
##  1993-Clinton   600   1598        81   dem
##  1997-Clinton   719   2157       112   dem
##     2001-Bush   585   1584        97   rep
##     2005-Bush   725   2071       101   rep
##    2009-Obama   893   2390       112   dem
##    2013-Obama   781   2097        90   dem
## 
## Source:  /Users/ksosulsk/Desktop/ITAUR-master/1_demo/* on x86_64 by ksosulsk
## Created: Mon Oct 19 11:11:43 2015
## Notes:

We can use this metadata to combine features across documents when creating a document-feature matrix:

partyDfm <- dfm(recentCorpus, groups='Party', ignoredFeatures=(stopwords('english')))

## Creating a dfm from a corpus ...
##    ... grouping texts by variable: Party
##    ... lowercasing
##    ... tokenizing
##    ... indexing documents: 2 documents
##    ... indexing features: 2,303 feature types
##    ... removed 115 features, from 174 supplied (fixed) feature types
##    ... created a 2 x 2188 sparse dfm
##    ... complete. 
## Elapsed time: 0.03 seconds.

wordcloud::comparison.cloud(t(as.matrix(partyDfm)))

Example workflow

Paul Nulty

October 18th 2015