Natural Language Processing

Natural language processing (NLP) is a set of techniques for using computers to detect in human language the kinds of things that humans detect automatically. For example, when you read a text, you parse a text out into paragraphs and sentences. You do not explicitly label the parts of speech, but you certainly understand them. You notice names of people and places as they come up. And you can tell whether a sentence or paragraph is happy or angry or sad.

This kind of analysis is difficult in any programming language, not least because human language can be so rich and subtle that computer languages cannot capture anywhere near the total amount of information “encoded” in it. But when it comes to natural language processing, R programmers have reason to envy Python programmers. The Natural Langauge Toolkit (NLTK) for Python is a robust library and set of corpuses, and the accompanying book Natural Language Processing with Python is an excellent guide to the practice of NLP.¹ As explained in the introduction, R and Python are close competitors in many kinds of data analysis and digital history, but if you were going to do only NLP then the NLTK would be a clear winner.

Nevertheless, R does have good libraries for natural language processing. Because R is able to interface with other languages like C, C++, and Java, it is possible to use libraries written in those lower-level and hence faster languages, while writing your code in R and taking advantage of its functional programming style and its many other libraries for data analysis.² Indeed most of the techniques that are of most use for historians, such as word and sentence tokenization, n-gram creation, and named entity recogniztion are easily peformed in R.³

After explaining how to install the necessary libraries, this chapter will use a sample paragraph to demonstrate a few key techniques. First, we will tokenize the paragraph into words, sentences, and n-grams. (The n-grams will be used in the chapter on document similarity.) Next we will extract the names of people and places from the document. Finally we will use those techniques on the journals or autobiographies of three itinerant preachers from the nineteenth-century United States to see which people they mention in common and which places they visited. For other kinds of text-analysis problems, see the chapters on document similarity and on topic modeling.

Installing the necessary libraries

There are several R packages which make natural langauge processing possible in R. The NLP package provides a set of classes and functions for NLP which are used widely by other packages in R. The openNLP package provides an interface to the Apache OpenNLP library, which is written in Java. RWeka provides an R interface to the Weka data mining software, also written in Java. RWeka is especially useful for creating ngrams. You may wish to investigate the qdap package, which contains many functions for qualitative discourse analysis. You can see other possibly useful packages at the CRAN task view on natural language processing.⁴

Both openNLP and RWeka depend on the rJava package, which provides the low-level connection to Java. To install these natural language processing packages, it is important to get a working installation of Java and rJava on your machine. If at all possible, you should use the system version of Java installed by your operating system. You can find out whether you have Java by running which java at your terminal. If Java is present, you can check its version by running java -version. Otherwise you will have to install the Java Developer’s Kit (JDK) to your computer. If you have to install Java, try running R CMD javareconf in your terminal (you may need to prefix that command with sudo).

On Ubuntu, it is easiest to install rJava from an Ubuntu package using the following command:

apt-get update
apt-get install r-cran-rjava

Once you have Java on your machine, you can install rJava from CRAN with install.packages("rJava"). If the installation of that package is successful and if it can find your version of Java, then you should be able to load the package without any messages. If you receive an error message when you attempt to load rJava, you may need to set path or environment variables.⁵

library(rJava)

Once you can successfuly load rJava, then you can install the other packages from CRAN.

install.packages(c("NLP", "openNLP", "RWeka", "qdap"))

Note that these NLP packages and the underlying Java libraries depend on large models of human languages. For some uses in English you should not need to download other data or models. But if you want to use a language other than English, or if you want to use entity extraction as below, you can download the openNLP models from a repository (not CRAN but like it) called Datacube. You can look in the NLP documentation or use the following command to install models, substituting a language code for en as appropriate.

install.packages("openNLPmodels.en",
                 repos = "http://datacube.wu.ac.at/",
                 type = "source")

Assuming you have successfully installed everything you should be able to load the following libraries.

library(NLP)
library(openNLP)
library(RWeka)

Basic Tokenization

To learn some basics of text processing,⁶ we will work with a paragraph from the biography of Jarena Lee, an African American woman born around 1783 who was an exhorter and itinerant in the African Methodist Episcopal Church.⁷ Later we will work with the full text of Lee’s Religious Experience and Journal of Mrs. Jarena Lee.

Jarena Lee

We will learn to read the text file into R, to break it into words and sentences, and to turn it into n-grams. These are all called tokenization, because we are breaking up the text into units of meaning, called tokens.

Reading a text file

There are a number of ways to read a text file into R. In particular the scan() function provides many options for reading a file and breaking it into components. But the simplest way to read a plain text file is with the readLines() function, which reads each line as a separate character vector.

bio <- readLines("data/nlp/anb-jarena-lee.txt")
print(bio)

##  [1] "In 1804, after several months of profound spiritual anxiety, Jarena Lee" 
##  [2] "moved from New Jersey to Philadelphia. There she labored as a domestic"  
##  [3] "and worshiped among white congregations of Roman Catholics and mixed"    
##  [4] "congregations of Methodists. On hearing an inspired sermon by the"       
##  [5] "Reverend Richard Allen, founder of the Bethel African Methodist"         
##  [6] "Episcopal Church, Lee joined the Methodists. She was baptized in 1807."  
##  [7] "Prior to her baptism, she experienced the various physical and emotional"
##  [8] "stages of conversion: terrifying visions of demons and eternal"          
##  [9] "perdition; extreme feelings of ecstasy and depression; protracted"       
## [10] "periods of meditation, fasting, and prayer; ennui and fever; energy and" 
## [11] "vigor. In 1811 she married Joseph Lee, who pastored an African-American" 
## [12] "church in Snow Hill, New Jersey. They had six children, four of whom"    
## [13] "died in infancy."

You can see that there are thirteen lines in the file, each contained in a character vector. We can combine all of these character vectors into a single character vector using the paste() function, adding a space between each of them.

bio <- paste(bio, collapse = " ")
print(bio)

## [1] "In 1804, after several months of profound spiritual anxiety, Jarena Lee moved from New Jersey to Philadelphia. There she labored as a domestic and worshiped among white congregations of Roman Catholics and mixed congregations of Methodists. On hearing an inspired sermon by the Reverend Richard Allen, founder of the Bethel African Methodist Episcopal Church, Lee joined the Methodists. She was baptized in 1807. Prior to her baptism, she experienced the various physical and emotional stages of conversion: terrifying visions of demons and eternal perdition; extreme feelings of ecstasy and depression; protracted periods of meditation, fasting, and prayer; ennui and fever; energy and vigor. In 1811 she married Joseph Lee, who pastored an African-American church in Snow Hill, New Jersey. They had six children, four of whom died in infancy."

Sentence and Word Annotations

Now that we have the file loaded, we can begin to turn it into words and sentences. This is a prerequisite for any other kind of natural language processing, because those kinds of NLP actions will need to know where the words and sentences are. First we load the necessary libraries.

library(NLP)
library(openNLP)
library(magrittr)

For many kinds of text processing it is sufficient, even preferable to use base R classes. But for NLP we are obligated to use the String class. We need to convert our bio variable to a string.

bio <- as.String(bio)

Next we need to create annotators for words and sentences. Annotators are created by functions which load the underlying Java libraries. These functions then mark the places in the string where words and sentences start and end. The annotation functions are themselves created by functions.

word_ann <- Maxent_Word_Token_Annotator()
sent_ann <- Maxent_Sent_Token_Annotator()

These annotators form a “pipeline” for annotating the text in our bio variable. First we have to determine where the sentences are, then we can determine where the words are. We can apply these annotator functions to our data using the annotate() function.

bio_annotations <- annotate(bio, list(sent_ann, word_ann))

The result is a annotation object. Looking at first few items contained in the object, we can the kind of information contained in the annotations object.

class(bio_annotations)

## [1] "Annotation" "Span"

head(bio_annotations)

##  id type     start end features
##   1 sentence     1 110 constituents=<<integer,20>>
##   2 sentence   112 240 constituents=<<integer,20>>
##   3 sentence   242 386 constituents=<<integer,25>>
##   4 sentence   388 412 constituents=<<integer,6>>
##   5 sentence   414 693 constituents=<<integer,49>>
##   6 sentence   695 791 constituents=<<integer,19>>

We see that the annotation object contains a list of sentences (and also words) identified by position. That is, the first sentence in the document begins at character 1 and ends at character 111. The sentences also contain information about the positions of the words that comprise them.

We can combine the biography and the annotations to create what the NLP package calls an AnnotatedPlainTextDocument. If we wishd we could also associate metadata with the object using the meta = argument.

bio_doc <- AnnotatedPlainTextDocument(bio, bio_annotations)

Now we can extract information from our document using accessor functions like sents() to get the sentences and words() to get the words. We could get just the plain text with as.character(bio_doc).

sents(bio_doc) %>% head(2)

## [[1]]
##  [1] "In"           "1804"         ","            "after"       
##  [5] "several"      "months"       "of"           "profound"    
##  [9] "spiritual"    "anxiety"      ","            "Jarena"      
## [13] "Lee"          "moved"        "from"         "New"         
## [17] "Jersey"       "to"           "Philadelphia" "."           
## 
## [[2]]
##  [1] "There"         "she"           "labored"       "as"           
##  [5] "a"             "domestic"      "and"           "worshiped"    
##  [9] "among"         "white"         "congregations" "of"           
## [13] "Roman"         "Catholics"     "and"           "mixed"        
## [17] "congregations" "of"            "Methodists"    "."

words(bio_doc) %>% head(10)

##  [1] "In"        "1804"      ","         "after"     "several"  
##  [6] "months"    "of"        "profound"  "spiritual" "anxiety"

This is already useful, since we could use the resulting lists of sentences words to perform other kinds of calculations. But there are other kinds of annotations which are more immediately relevant to historians.

Annotating people and places

Among the several kinds of annotators provided by the openNLP package is an entity annotator. An entity is basically a proper noun, such as a person or place name. Using a technique called named entity recognition (NER), we can extract various kinds of names from a document. In English, OpenNLP can find dates, locations, money, organizations, percentages, people, and times. (Acceptable values are "date", "location", "money", "organization", "percentage", "person", "misc".) We will use it to find people, places, and organizations since all three are mentioned in our sample paragraph.

These kinds of annotator functions are created using the same kinds of constructor functions that we used for word_ann() and sent_ann().

person_ann <- Maxent_Entity_Annotator(kind = "person")
location_ann <- Maxent_Entity_Annotator(kind = "location")
organization_ann <- Maxent_Entity_Annotator(kind = "organization")

Recall that we earlier passed a list of annotator functions to the annotate() function to indicate which kinds of annotations we wanted to make. We will create a new pipeline list to hold our annotators in the order we want to apply them, then apply it to the bio variable. Then, as before, we can create an AnnotatedPlainTextDocument.

pipeline <- list(sent_ann,
                 word_ann,
                 person_ann,
                 location_ann,
                 organization_ann)
bio_annotations <- annotate(bio, pipeline)
bio_doc <- AnnotatedPlainTextDocument(bio, bio_annotations)

As before we could extract words and sentences using the getter methods words() and sents(). Unfortunately there is no comparably easy way to extract names entities from documents. But the function below will do the trick.

# Extract entities from an AnnotatedPlainTextDocument
entities <- function(doc, kind) {
  s <- doc$content
  a <- annotations(doc)[[1]]
  if(hasArg(kind)) {
    k <- sapply(a$features, `[[`, "kind")
    s[a[k == kind]]
  } else {
    s[a[a$type == "entity"]]
  }
}

Now we can extract all of the named entities using entities(bio_doc), and specific kinds of entities using the kind = argument. Let’s get all the people, places, and organizations.

entities(bio_doc, kind = "person")

## [1] "Jarena Lee"    "Richard Allen" "Lee"           "Joseph Lee"

entities(bio_doc, kind = "location")

## [1] "New Jersey"   "Philadelphia" "New Jersey"

entities(bio_doc, kind = "organization")

## Bethel African Methodist Episcopal Church

Applying our techniques to this paragraph shows both the power and the limitations of NLP. We managed to extract the every person named in the text: Jarena Lee, Richard Allen, and Joseph Lee. But Jarena Lee’s six children were not detected. Both New Jersey and Philadelphia were detected, but “Snow Hill, New Jersey” was not, perhaps because “snow” and “hill” fooled the algorithm into thinking they were common nouns. The Bethel African Methodist Episcopal Church was detected, but not the unnamed African American church in Snow Hill; arguably “Methodists” is also an institution.

Of course we would hardly rely on NLP for short texts such as this. NLP is potentially useful when applied to texts of greater length—especially to more texts than we could read ourself. In the next section, we will extend NLP to a small corpus of larger texts.

Named Entity Recognition in a small corpus

Now that we know how to extract people and places from a text, we can do the same thing with a small corpus of texts. The code would be identical, or nearly so, for a much larger corpus. For this exercise, we will use three books by itinerant preachers in the nineteenth-century United States:

Peter Cartwright, Autobiography of Peter Cartwright, the Backwoods Preacher, edited by W. P. Strickland (New York, 1857).
Jarena Lee, Religious Experience and Journal of Mrs. Jarena Lee (Philadelphia, 1849).
Parley Parker, The Autobiography of Parley Parker Pratt (New York, 1874).

These three people were rough contemporaries. Peter Cartwright (1785–1872) was a Methodist; Jarena Lee (1783–?) was a member of the African Methodist Episcopal Church; Parley Parker Pratt (1807–1857) was a Mormon apostle. Their books have been downloaded and OCRed from Google Books. Because they are mostly autobiographical, the places and people they mention were by and large actually places they went and people they met. Named entity recognition is thus likely to produce a picture closer actual experience instead of the imagined worlds of these people. These texts are by no means perfectly OCRed, but they do represent texts of the quality at which historians often have to work.

Throughout this exercise we will use R’s functional programming facilities, especially lapply() to perform the same actions on each text. We could possibly copy each and every command for all three texts, change the names of files and variables as necessary. But this would be a recipe for disaster, or more likely, subtle error. By keeping to the programming principle DRY—don’t repeat yourself—we can avoid unnecessary complications. Just as important, our code will be extensible not just to 3 documents, but to 300 or 3,000.

Before we start we have to load the necessary libraries.

library(NLP)
library(openNLP)
library(magrittr)

Let’s begin by finding the paths to each of the files using R’s Sys.glob() function, which looks for wildcards in file names.

filenames <- Sys.glob("data/itinerants/*.txt")
filenames

## [1] "data/itinerants/cartwright-peter.txt"
## [2] "data/itinerants/lee-jarena.txt"      
## [3] "data/itinerants/pratt-parley.txt"

Now we can use lapply() to apply the readLines() function to each filename. The result will be a list with one item per file. And while we are at it, we can use paste0() to combine each line of the files into a single character vector, and as.String() to convert them to the format that the NLP packages expect. Finally, we can assign the names of the files to the items in the list.

texts <- filenames %>%
  lapply(readLines) %>%
  lapply(paste0, collapse = " ") %>%
  lapply(as.String)

names(texts) <- basename(filenames)

str(texts, max.level = 1)

## List of 3
##  $ cartwright-peter.txt:Class 'String'  chr "Autobiography of Peter Cartwright  Peter Cartwright, William Peter Strickland     iéarbatb Eihinitp fitbool  1|E.!|t     ANDOVE"| __truncated__
##  $ lee-jarena.txt      :Class 'String'  chr "RELIGIOUS EXPERIENCE  \\JOURNAL  01'  MRS. JARENA 1_;EE,  ctvma  A AN ACCOUNT DRHRLCALL T0 PREACH THE GOSPEL.  Revised and corr"| __truncated__
##  $ pratt-parley.txt    :Class 'String'  chr "THE  AUTOBIOGRAPHY    PARLEY PARKER PRATT,  ONE 0? THE  TWELVE APOSTLES  OP T113  Ehumh nf Jesus Ehrist nf Lartter-flay, Saints"| __truncated__

Next we need to take the steps we used above to annotate a document and abstract them into a function. This will let us use the function with lapply(). It will also permit us to re-use the code in future projects. This function will return an object of class AnnotatedPlainTextDocument.

annotate_entities <- function(doc, annotation_pipeline) {
  annotations <- annotate(doc, annotation_pipeline)
  AnnotatedPlainTextDocument(doc, annotations)
}

Now we can define the pipeline of annotation functions that we are interested in. We will use just person and locations:

itinerants_pipeline <- list(
  Maxent_Sent_Token_Annotator(),
  Maxent_Word_Token_Annotator(),
  Maxent_Entity_Annotator(kind = "person"),
  Maxent_Entity_Annotator(kind = "location")
)

Now we can call our annotate_entities() function on each item in our list. (This function will take a considerable amount of time: a little more than a half hour on my computer.)

# We won't actually run this long-running process. Instead we will just load the
# cached results.
load("data/nlp-cache.rda")

texts_annotated <- texts %>%
  lapply(annotate_entities, itinerants_pipeline)

It is now possible to use our entities() function defined above to extract the relevant information. We could keep these all in a single list object, but to keep it from being unwieldy, we will create a list of places and of people mentioned in each text.

places <- texts_annotated %>%
  lapply(entities, kind = "location")

people <- texts_annotated %>%
  lapply(entities, kind = "person")

A few statistics will give us a sense of what we have managed to extract. We can count up the number of items, as well as the number of unique items for each text.

# Total place mentions 
places %>%
  sapply(length)

## cartwright-peter.txt       lee-jarena.txt     pratt-parley.txt 
##                 1219                  358                 1595

# Unique places
places %>%
  lapply(unique) %>%
  sapply(length)

## cartwright-peter.txt       lee-jarena.txt     pratt-parley.txt 
##                  334                  120                  370

# Total mentions of people
people %>%
  sapply(length)

## cartwright-peter.txt       lee-jarena.txt     pratt-parley.txt 
##                  586                  201                 1110

# Unique people mentioned
people %>%
  lapply(unique) %>%
  sapply(length)

## cartwright-peter.txt       lee-jarena.txt     pratt-parley.txt 
##                  317                  125                  503

We could do a lot with this information. We could improve the lists by editing them with our knowledge as historians. In particular, we could geocode the locations and create a map of the world of each itinerant. (See the chapter on mapping.) We also have created a simple list of the people and places without regard for where they are in the document. But we have the exact location of each person and place in the document, and could use that information for further analysis.

library(ggmap)

## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## 
## The following object is masked from 'package:NLP':
## 
##     annotate
## 
## 
## Attaching package: 'ggmap'
## 
## The following object is masked from 'package:magrittr':
## 
##     inset

all_places <- union(places[["pratt-parley.txt"]], places[["cartwright-peter.txt"]]) %>% union(places[["lee-jarena.txt"]])
# all_places_geocoded <- geocode(all_places)

In this next section we’ll look at how to use some of R’s set functions to figure out the overlap between these three itinerants.

Steven Bird, Ewan Klein, and Edward Loper, Natural Language Processing with Python (Cambridge, MA: O’Reilly, 2009), http://www.nltk.org/book/.↩
Because the NLTK is written in Python it can be quite slow. R is not exactly a speed-demon either, but because most of the NLP libraries that it uses are written in Java, they are least have the potential to match or better NLTK’s performance.↩
Natural language processing includes an large number of tasks: the Wikipedia article on NLP lists more than thirty kinds of NLP, though these classifications are arbitrary. Many of these are not often essential to digital historians—for example, text to speech, part-of-speech tagging, and natural language question answering—or perhaps we should say, that they are insufficiently developed to be useful to historians yet. This chapter focuses on a few concete tasks with ready applications to history.↩
The NLP package also supports the Stanford CoreNLP suite of tools written in Java. We will use openNLP instead in this chapter. But if you wish to install Stanford CoreNLP, you can install it from the DataCube repository: install.packages("StanfordCoreNLP", repos = "http://datacube.wu.ac.at", type = "source").↩
Running R CMD javareconf --help may provide some guidance.↩
You may also find it helpful to review the chapter on working with strings, or character vectors as they are properly called in R.↩
The paragraph is taken from Lee’s biography in American National Biography Online, s.v. “Lee, Jarena,” by Barbara McCaskill, http://www.anb.org/articles/16/16-03109.html.↩