Tokenization: Representing raw text into meaningful units of text (tokens) so we can perform computations on them.

suppressWarnings(if (!require("pacman"))install.packages("pacman"))
## Loading required package: pacman
pacman::p_load(tidyverse, tidytext, tokenizers, hcandersenr, here)

2.1 What is a token?

In R, text is represented with the character data type similar to strings in other languages.

Exploring text from fairy tales written by Hans Christian Andersen

# Load required packages
library(tokenizers)
library(tidyverse)
library(tidytext)
library(hcandersenr)

# Narrow to a single book
the_fir_tree <- hcandersen_en %>% 
  filter(book == "The fir tree") %>% 
  pull(text)

the_fir_tree %>% 
  head(n = 9)
## [1] "Far down in the forest, where the warm sun and the fresh air made a sweet"    
## [2] "resting-place, grew a pretty little fir-tree; and yet it was not happy, it"   
## [3] "wished so much to be tall like its companions– the pines and firs which grew" 
## [4] "around it. The sun shone, and the soft air fluttered its leaves, and the"     
## [5] "little peasant children passed by, prattling merrily, but the fir-tree heeded"
## [6] "them not. Sometimes the children would bring a large basket of raspberries or"
## [7] "strawberries, wreathed on a straw, and seat themselves near the fir-tree, and"
## [8] "say, \"Is it not a pretty little tree?\" which made it feel more unhappy than"
## [9] "before."

In tokenization, we take an input (a string) and a token (eg a word) and split the input into tokens. Most commonly, the meaningful token we want to split text into is a word.

To understand the process of tokenization, let’s start with an overly simple definition for a word: any selection of alphanumeric (letter and numbers) symbols

base::strsplit() - Split the elements of a character vector x into substrings according to the matches to substring split within them.

the_fir_tree[1:2]
## [1] "Far down in the forest, where the warm sun and the fresh air made a sweet" 
## [2] "resting-place, grew a pretty little fir-tree; and yet it was not happy, it"
strsplit(the_fir_tree[1:2], split = "[^a-zA-Z0-9]+")
## [[1]]
##  [1] "Far"    "down"   "in"     "the"    "forest" "where"  "the"    "warm"  
##  [9] "sun"    "and"    "the"    "fresh"  "air"    "made"   "a"      "sweet" 
## 
## [[2]]
##  [1] "resting" "place"   "grew"    "a"       "pretty"  "little"  "fir"    
##  [8] "tree"    "and"     "yet"     "it"      "was"     "not"     "happy"  
## [15] "it"

At first sight, this result looks pretty decent. However, we have lost all punctuation, which may or may not be helpful for our modeling goal, and the hero of this story ("fir-tree") was split in half.

Better way?

Tokenization with tokenizers package

library(tokenizers)
the_fir_tree[1:2]
## [1] "Far down in the forest, where the warm sun and the fresh air made a sweet" 
## [2] "resting-place, grew a pretty little fir-tree; and yet it was not happy, it"
# Basic tokenization into words
tokenize_words(the_fir_tree[1:2])
## [[1]]
##  [1] "far"    "down"   "in"     "the"    "forest" "where"  "the"    "warm"  
##  [9] "sun"    "and"    "the"    "fresh"  "air"    "made"   "a"      "sweet" 
## 
## [[2]]
##  [1] "resting" "place"   "grew"    "a"       "pretty"  "little"  "fir"    
##  [8] "tree"    "and"     "yet"     "it"      "was"     "not"     "happy"  
## [15] "it"

We see sensible single-word results here; the tokenize_words() function uses the stringi package (Gagolewski 2020) and C++ under the hood, making it very fast + it is many times more sophisticated than our initial approach of splitting on non-alphanumeric characters.

2.2 Types of tokens

Thinking of a token as a word is a useful way to start understanding tokenization, even if it is hard to implement concretely in software. We can generalize the idea of a token beyond only a single word to other units of text. We can tokenize text at a variety of units including:

  • characters,

  • words,

  • sentences,

  • lines,

  • paragraphs, and

  • n-grams.

In the following sections, we will explore how to tokenize text using the tokenizers package. These functions take a character vector as the input and return lists of character vectors as output. This same tokenization can also be done using the tidytext (Silge and Robinson 2016) package, for workflows using tidy data principles where the input and output are both in a dataframe.

Let’s see this in action:

# Character vector
sample_vector <- c("Far down in the forest",
                   "grew a pretty little fir-tree")

sample_vector
## [1] "Far down in the forest"        "grew a pretty little fir-tree"
# Tibble vector
sample_tibble <- tibble(text = sample_vector)

sample_tibble

Tokenization achieved by tokenizers::tokenize_words() on character vector:

tokenize_words(sample_vector)
## [[1]]
## [1] "far"    "down"   "in"     "the"    "forest"
## 
## [[2]]
## [1] "grew"   "a"      "pretty" "little" "fir"    "tree"

will yield the same results as using tidytext::unnest_tokens() on sample_tibble; the only difference is the data structure, and thus how we might use the result moving forward in our analysis.

unnest_tokens(tbl = sample_tibble, output = word, input = text, token = "words")

Arguments eg strip_punct used in tokenizers::tokenize_words() can be passed through tidytext::unnest_tokens() using the “the dots”, ....

# Passing arguements from tokenizers package to tidytext package
sample_tibble %>% 
  unnest_tokens(output = words, input = text, token = "words", strip_punct = FALSE)

Tidytext - More Tidyverse oriented?

2.2.1 Character tokens

Simplest tokenization which splits text into characters.

Let’s use tokenize_characters() with its default parameters; this function has arguments to convert to lowercase and to strip all non-alphanumeric characters. These defaults will reduce the number of different tokens that are returned.

# Split text into characters
tft_token_characters <- tokenize_characters(
  x = the_fir_tree,
  lowercase = TRUE,
  strip_non_alphanum = TRUE,
  simplify = FALSE
)

head(tft_token_characters) %>% 
  glimpse()
## List of 6
##  $ : chr [1:57] "f" "a" "r" "d" ...
##  $ : chr [1:57] "r" "e" "s" "t" ...
##  $ : chr [1:61] "w" "i" "s" "h" ...
##  $ : chr [1:56] "a" "r" "o" "u" ...
##  $ : chr [1:64] "l" "i" "t" "t" ...
##  $ : chr [1:64] "t" "h" "e" "m" ...

We don’t have to stick with the defaults. We can keep the punctuation and spaces by setting strip_alphanum = FALSE

# Keep punctuation and white spaces
the_fir_tree %>% 
  tokenize_characters(strip_non_alphanum = F) %>% 
  head() %>% 
  glimpse()
## List of 6
##  $ : chr [1:73] "f" "a" "r" " " ...
##  $ : chr [1:74] "r" "e" "s" "t" ...
##  $ : chr [1:76] "w" "i" "s" "h" ...
##  $ : chr [1:72] "a" "r" "o" "u" ...
##  $ : chr [1:77] "l" "i" "t" "t" ...
##  $ : chr [1:77] "t" "h" "e" "m" ...

Depending on the format you have your text data in, it might contain ligatures. Ligatures are when multiple graphemes or letters are combined as a single character The graphemes “f” and “l” are combined into “fl,” or “s” and “t” into “st.” When we apply normal tokenization rules the ligatures will not be split up.

2.2.2 Word tokens

Tokenizing at the word level is perhaps the most common and widely used tokenization.

# Word tokenization
tft_token_words <- tokenize_words(
  x = the_fir_tree,
  lowercase = T,
  stopwords = NULL,
  strip_punct = T,
  strip_numeric = F
)

# Results
tft_token_words %>% 
  head() %>% 
  glimpse()
## List of 6
##  $ : chr [1:16] "far" "down" "in" "the" ...
##  $ : chr [1:15] "resting" "place" "grew" "a" ...
##  $ : chr [1:15] "wished" "so" "much" "to" ...
##  $ : chr [1:14] "around" "it" "the" "sun" ...
##  $ : chr [1:12] "little" "peasant" "children" "passed" ...
##  $ : chr [1:13] "them" "not" "sometimes" "the" ...

Let’s create a tibble with two fairy tales, “The Fir-Tree” and “The Little Mermaid.” Then we can use unnest_tokens() together with some dplyr verbs to find the most commonly used words in each.

# A dance with data
hcandersen_en %>% 
  filter(book %in% c("The fir tree", "The little mermaid")) %>% 
  unnest_tokens(output = word, input = text, token = "words") %>% 
  group_by(book) %>% 
  count(word) %>% 
  arrange(desc(n)) %>% 
  slice(1:5)

The five most common words in each fairy tale are fairly uninformative, with the exception being "tree" in the “The Fir-Tree.”

These uninformative words are called stop words

2.2.3 Tokenize by n-grams

Can be defined loosely as a continuous sequence of n items from a given sequence of text or speech e.g a group of n words

Some example n-grams are:

  • unigram: “Hello,” “day,” “my,” “little”

  • bigram: “fir tree,” “fresh air,” “to be,” “Robin Hood”

  • trigram: “You and I,” “please let go,” “no time like,” “the little mermaid”

The benefit of using n-grams compared to words is that n-grams capture word order that would otherwise be lost. Similarly, when we use character n-grams, we can model the beginning and end of words.

# n_gram tokenization
tft_token_ngram <- tokenize_ngrams(x = the_fir_tree,
                                   lowercase = TRUE,
                  # The number of words in the n-gram
                                   n = 3L,
                  # Min n.o of ngrams to include
                                   n_min = 3L,
                                   stopwords = character(),
                  # Separator between words in an ngram
                                   ngram_delim = " ",
                                   simplify = F)

# First line
the_fir_tree[1]
## [1] "Far down in the forest, where the warm sun and the fresh air made a sweet"
# Ngram for first line
tft_token_ngram[[1]]
##  [1] "far down in"      "down in the"      "in the forest"    "the forest where"
##  [5] "forest where the" "where the warm"   "the warm sun"     "warm sun and"    
##  [9] "sun and the"      "and the fresh"    "the fresh air"    "fresh air made"  
## [13] "air made a"       "made a sweet"
  • N-gram tokenization slides along the text to create overlapping sets of tokens.

  • It is important to choose the right value for n (n.o of words in n-gram) when using n-grams for the question we want to answer. Using unigrams is faster and more efficient, but we don’t capture information about word order. Using a higher value for n keeps more information, but the vector space of tokens increases dramatically, corresponding to a reduction in token counts.

  • Good place is 3 but 2 can work if you don’t have a large vocabulary in your data set

  • Combining different degrees of n-grams can allow you to extract different levels of detail from text data. Unigrams tell you which individual words have been used a lot of times; some of these words could be overlooked in bigram or trigram counts if they don’t co-appear with other words often.

  • Unigrams alone don’t capture much info

# Different degrees of n-grams
tft_token_ngram2 <- tokenize_ngrams(
  x = the_fir_tree,
  n = 2L,
  n_min = 1
)

# Display
the_fir_tree[1]
## [1] "Far down in the forest, where the warm sun and the fresh air made a sweet"
tft_token_ngram2[[1]]
##  [1] "far"          "far down"     "down"         "down in"      "in"          
##  [6] "in the"       "the"          "the forest"   "forest"       "forest where"
## [11] "where"        "where the"    "the"          "the warm"     "warm"        
## [16] "warm sun"     "sun"          "sun and"      "and"          "and the"     
## [21] "the"          "the fresh"    "fresh"        "fresh air"    "air"         
## [26] "air made"     "made"         "made a"       "a"            "a sweet"     
## [31] "sweet"

2.2.4 Lines, sentences and paragraph tokens

Tokenizers to split text into larger units of text like lines, sentences, and paragraphs are rarely used directly for modeling purposes, as the tokens produced tend to be fairly unique. It is very uncommon for multiple sentences in a text to be identical!

However, these tokenizers are useful for preprocessing and labeling.

For example, Jane Austen’s novel Northanger Abbey (as available in the janeaustenr package) is already preprocessed with each line being at most 80 characters long. However, it might be useful to split the data into chapters and paragraphs instead.

  • Let’s create a function that takes a dataframe containing a variable called text and turns it into a dataframe where the text is transformed into paragraphs.

  • First, we can collapse the text into one long string using collapse = "\n" to denote line breaks,

  • and then next we can use tokenize_paragraphs() to identify the paragraphs and put them back into a dataframe. We can add a paragraph count with row_number().

paste0(…, collapse) is equivalent to paste(…, sep = ““, collapse), slightly more efficiently.

If a value is specified for collapse, the values in the result are then concatenated into a single string, with the elements being separated by the value of collapse.

# Return paragraphs function
add_paragraphs <- function(data){
  data %>% pull(text) %>% 
    # collapse output into a single string 
    paste(collapse = "\n") %>% 
    tokenize_paragraphs() %>% 
    # Flatten to produce a vector
    unlist() %>% 
    tibble(text = .) %>% 
    mutate(paragraph = row_number())
  
}

Now we take the raw text data and add the chapter count by detecting when the characters "CHAPTER" appears at the beginning of a line. Then we nest() the text column, apply our add_paragraphs() function, and then unnest() again.

Nest(): combines columns and outputs 1 row for each non-nested column

map and variants: similar to lapply etc–further reading

And in filter: & or ,

# Make paragraphs
library(janeaustenr)

northangerabbey_paragraphed <- tibble(text = northangerabbey) %>% 
       mutate(chapter = cumsum(text %>% str_detect("^CHAPTER "))) %>%
       filter(chapter > 0, !str_detect(text, "^CHAPTER ")) %>% 
       nest(data = text) %>%
       mutate(data = map(data, add_paragraphs)) %>% 
       unnest(cols = c(data))

glimpse(northangerabbey_paragraphed)
## Rows: 1,020
## Columns: 3
## $ chapter   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, ~
## $ text      <chr> "No one who had ever seen Catherine Morland in her infancy w~
## $ paragraph <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1~

It can be useful to be able to reshape text data to get a different observational unit. As an example, if you wanted to build a sentiment classifier that would classify sentences as hostile or not, then you need to work with and train your model on sentences of text. Turning pages or paragraphs into sentences is a necessary step in your workflow.

Let us look at how we can turn the_fir_tree from a “one line per element” vector to a “one sentence per element.” the_fir_tree comes as a vector so we start by using paste() to combine the lines back together. We use a space as the separator, and then we pass it to the tokenize_sentences() function from the tokenizers package, which will perform sentence splitting.

# One sentence per element
the_fir_tree_sentences <- the_fir_tree %>% 
  paste(collapse = " ") %>% 
  tokenize_sentences()

the_fir_tree_sentences[[1]] %>% 
  head()
## [1] "Far down in the forest, where the warm sun and the fresh air made a sweet resting-place, grew a pretty little fir-tree; and yet it was not happy, it wished so much to be tall like its companions– the pines and firs which grew around it."
## [2] "The sun shone, and the soft air fluttered its leaves, and the little peasant children passed by, prattling merrily, but the fir-tree heeded them not."                                                                                       
## [3] "Sometimes the children would bring a large basket of raspberries or strawberries, wreathed on a straw, and seat themselves near the fir-tree, and say, \"Is it not a pretty little tree?\""                                                  
## [4] "which made it feel more unhappy than before."                                                                                                                                                                                                
## [5] "And yet all this while the tree grew a notch or joint taller every year; for by the number of joints in the stem of a fir-tree we can discover its age."                                                                                     
## [6] "Still, as it grew, it complained."

If you have lines from different categories as we have in the hcandersen_en dataframe, which contains all the lines of the fairy tales in English, then we would like to be able to turn these lines into sentences while preserving the book column in the data set. To do this we use nest() and map_chr() to create a dataframe where each fairy tale is its own element and then we use the unnest_sentences() function from the tidytext package to split the text into sentences.

used tidytext::nnest_sentences() and not tokenizers::tokenize_sentencesbecause the data is a tibble

# Split text into sentences
hcandersen_sentences <- hcandersen_en %>%
       nest(data = c(text)) %>% 
       mutate(data = map_chr(data, ~paste(.x$text, collapse = " "))) %>% 
       unnest_sentences(input = data, output = sentences)

Now that we have turned the text into “one sentence per element,” we can analyze on the sentence level.

2.3 Where does tokenization break down?

Tokenization will generally be one of the first steps when building a model or any kind of text analysis, so it is important to consider carefully what happens in this step of data preprocessing. As with most software, there is a trade-off between speed and customizability, as demonstrated in Section 2.6. The fastest tokenization methods give us less control over how it is done.

While the defaults work well in many cases, we encounter situations where we want to impose stricter rules to get better or different tokenized results.

2.4 Build your own tokenizer

  • Sometimes the out-of-the-box tokenizers won’t be able to do what you need them to do. In this case, we will have to wield stringi/stringr and regular expressions

  • There are two main approaches to tokenization.

    1. Split the string up according to some rule.

    2. Extract tokens based on some rule.

  • We can reach complex outcomes by chaining together many smaller rules.

2.4.1 Tokenize to characters, only keeping letters

Here we want to modify what tokenize_characters() does, such that we only keep letters. There are two main options. We can use tokenize_characters() and remove anything that is not a letter, or we can extract the letters one by one. Let’s try the latter option. This is an extract task, and we will use str_extract_all() as each string has the possibility of including more than one token. Since we want to extract letters we can use the letters character class [:alpha:] to match letters and the quantifier {1} to only extract the first one.

stringi::stringi-search-regex

?regex()

[pattern] - Match any one character from the set.

{n} - Match exactly n times

regular expressions list: https://smltar.com/regexp.html#tab:characterclasses

string <- "This sentence includes 2 numbers and 1 period."

str_extract_all(string = string,
            pattern = "[:alpha:]{1}")
## [[1]]
##  [1] "T" "h" "i" "s" "s" "e" "n" "t" "e" "n" "c" "e" "i" "n" "c" "l" "u" "d" "e"
## [20] "s" "n" "u" "m" "b" "e" "r" "s" "a" "n" "d" "p" "e" "r" "i" "o" "d"
  • We may be tempted to specify the character class as something like [a-zA-Z]{1}. This option would run faster, but we would lose non-English letter characters.
danish_sentence <- "Så mødte han en gammel heks på landevejen"

str_extract_all(danish_sentence, "[:alpha:]{1}")
## [[1]]
##  [1] "S" "å" "m" "ø" "d" "t" "e" "h" "a" "n" "e" "n" "g" "a" "m" "m" "e" "l" "h"
## [20] "e" "k" "s" "p" "å" "l" "a" "n" "d" "e" "v" "e" "j" "e" "n"
str_extract_all(danish_sentence, "[a-zA-z]{1}")
## [[1]]
##  [1] "S" "m" "d" "t" "e" "h" "a" "n" "e" "n" "g" "a" "m" "m" "e" "l" "h" "e" "k"
## [20] "s" "p" "l" "a" "n" "d" "e" "v" "e" "j" "e" "n"

Choosing between [:alpha:] and [a-zA-Z] may seem quite similar, but the resulting differences can have a big impact on your analysis.

2.4.2 Allow for hyphenated words

In our examples so far, we have noticed that the string “fir-tree” is typically split into two tokens. Let’s explore two different approaches for how to handle this hyphenated word as one token.

  • First, let’s split on white space (Space, tab, vertical tab, newline, form feed, carriage return) ; this is a decent way to identify words in English and some other languages, and it does not split hyphenated words as the hyphen character isn’t considered a white-space.

  • Second, let’s find a regex to match words with a hyphen and extract those.

Splitting by white space is not too difficult because we can use character classes, as shown in Table A.2. We will use the white space character class [:space:] to split our sentence.

str_split(string = "This isn't a sentence with hyphanated-words..",
          pattern = "[:space:]")
## [[1]]
## [1] "This"               "isn't"              "a"                 
## [4] "sentence"           "with"               "hyphanated-words.."

This worked pretty well. This version doesn’t drop punctuation, but we can achieve this by removing punctuation characters at the beginning and end of words.

str_split("This isn't a sentence with hyphenated-words.",
          pattern = "[:space:]") %>% 
  map(~ str_remove_all(.x, "^[:punct:]+|[:punct:]+$"))
## [[1]]
## [1] "This"             "isn't"            "a"                "sentence"        
## [5] "with"             "hyphenated-words"

This regex used to remove the punctuation is a little complicated, so let’s discuss it piece by piece.

  • The regex ^[:punct:]+ will look at the beginning of the string (^) to match any punctuation characters ([:punct:]), where it will select one or more (+).

  • The other regex [:punct:]+$ will look for punctuation characters ([:punct:]) that appear one or more times (+) at the end of the string ($).

  • These will alternate (|) so that we get matches from both sides of the words.

  • The reason we use the quantifier + is that there are cases where a word is followed by multiple characters we don’t want, such as "okay..." and "Really?!!!".

  • We can specify how many times we expect something to occur using quantifiers eg +(one or more times)

Now let’s see if we can get the same result using extraction. We will start by constructing a regular expression that will capture hyphenated words; our definition here is a word with one hyphen located inside it. Since we want the hyphen to be inside the word, we will need to have a non-zero number of characters on either side of the hyphen.

str_extract_all(
  string = "This isn't a sentence with hyphenated-words.",
  pattern = "[:alpha:]+-[:alpha:]+")
## [[1]]
## [1] "hyphenated-words"

[:alpha:]: extracts letters

[:alpha:]+ : extracts words

[:alpha:]+[:alpha:]: ensures that each word is more than a character long so words like a are not included

Wait, this only matched the hyphenated word! This happened because we are only matching words with hyphens. If we add the quantifier ? then we can match 0 or 1 occurrences.

str_extract_all(
  string = "This isn't a sentence with hyphenated-words.",
  pattern = "[:alpha:]+-?[:alpha:]+")
## [[1]]
## [1] "This"             "isn"              "sentence"         "with"            
## [5] "hyphenated-words"

Now we are getting more words, but the ending of "isn't" is not there anymore and we lost the word "a".

  • We can get matches for the whole contraction by expanding the character class [:alpha:] to include the character '. We do that by using [[:alpha:]'].
# Eric's approach
str_extract_all(
  string = "This isn't a sentence with hyphenated-words.",
  pattern = "[[:alpha:]'-]+")
## [[1]]
## [1] "This"             "isn't"            "a"                "sentence"        
## [5] "with"             "hyphenated-words"
# Emil's approach
str_extract_all(
  string = "This isn't sentence with hyphenated-words.",
  pattern = "[[:alpha:]']+-?[[:alpha:]']+")
## [[1]]
## [1] "This"             "isn't"            "sentence"         "with"            
## [5] "hyphenated-words"

Next, we need to find out why "a" wasn’t matched. If we look at the regular expression, we remember that we imposed the restriction that a non-zero number of characters needed to surround the hyphen to avoid matching words that start or end with a hyphen. This means that the smallest possible pattern matched is two characters long. We can fix this by using an alternation with |. We will keep our previous match on the left-hand side, and include [:alpha:]{1} on the right-hand side to match the single length words that won’t be picked up by the left-hand side. Notice how we aren’t using [[:alpha:]'] since we are not interested in matching single ' characters.

str_extract_all(
  string = "This isn't a sentence with hyphenated-words.",
  pattern = "[[:alpha:]']+-?[[:alpha:]']+|[:alpha:]{1}"
)
## [[1]]
## [1] "This"             "isn't"            "a"                "sentence"        
## [5] "with"             "hyphenated-words"

2.4.3 Wrapping it in a function

We have shown how we can use regular expressions to extract the tokens we want, perhaps to use in modeling. So far, the code has been rather unstructured. We would ideally wrap these tasks into functions that can be used the same way tokenize_words() is used.

Let’s start with the example with hyphenated words. To make the function a little more flexible, let’s add an option to transform all the output to lowercase.

# Function to hyphenate words
tokenize_hyphenated_words <- function(x, lowercase = TRUE){
  if (lowercase) x = str_to_lower(x)
  
  str_split(x, "[:space:]") %>% 
    map(~ str_remove_all(.x, "^[:punct:]+|[:punct:]+$"))
}


# Book extract
the_fir_tree[1:3]
## [1] "Far down in the forest, where the warm sun and the fresh air made a sweet"   
## [2] "resting-place, grew a pretty little fir-tree; and yet it was not happy, it"  
## [3] "wished so much to be tall like its companions– the pines and firs which grew"
# Call function
tokenize_hyphenated_words(x = the_fir_tree[1:3])
## [[1]]
##  [1] "far"    "down"   "in"     "the"    "forest" "where"  "the"    "warm"  
##  [9] "sun"    "and"    "the"    "fresh"  "air"    "made"   "a"      "sweet" 
## 
## [[2]]
##  [1] "resting-place" "grew"          "a"             "pretty"       
##  [5] "little"        "fir-tree"      "and"           "yet"          
##  [9] "it"            "was"           "not"           "happy"        
## [13] "it"           
## 
## [[3]]
##  [1] "wished"     "so"         "much"       "to"         "be"        
##  [6] "tall"       "like"       "its"        "companions" "the"       
## [11] "pines"      "and"        "firs"       "which"      "grew"


Notice how we transformed to lowercase first because the rest of the operations are case insensitive.

Next let’s turn our character n-gram tokenizer into a function, with a variable n argument.

Little idea of what is going on below

tokenize_character_ngram <- function(x, n) {
  ngram_loc <- str_locate_all(x, paste0("(?=(\\w{", n, "}))"))

  map2(ngram_loc, x, ~str_sub(.y, .x[, 1], .x[, 1] + n - 1))
}

tokenize_character_ngram(the_fir_tree[1:3], n = 3)
## [[1]]
##  [1] "Far" "dow" "own" "the" "for" "ore" "res" "est" "whe" "her" "ere" "the"
## [13] "war" "arm" "sun" "and" "the" "fre" "res" "esh" "air" "mad" "ade" "swe"
## [25] "wee" "eet"
## 
## [[2]]
##  [1] "res" "est" "sti" "tin" "ing" "pla" "lac" "ace" "gre" "rew" "pre" "ret"
## [13] "ett" "tty" "lit" "itt" "ttl" "tle" "fir" "tre" "ree" "and" "yet" "was"
## [25] "not" "hap" "app" "ppy"
## 
## [[3]]
##  [1] "wis" "ish" "she" "hed" "muc" "uch" "tal" "all" "lik" "ike" "its" "com"
## [13] "omp" "mpa" "pan" "ani" "nio" "ion" "ons" "the" "pin" "ine" "nes" "and"
## [25] "fir" "irs" "whi" "hic" "ich" "gre" "rew"

We can use paste0() in this function to construct an actual regex.

2.5 Tokenization for non-Latin alphabets

Our discussion of tokenization so far has focused on text where words are separated by white space and punctuation. For such text, even a quite basic tokenizer can give decent results. However, many written languages don’t separate words in this way.

One of these languages is Chinese where each “word” can be represented by one or more consecutive characters. Splitting Chinese text into words is called “word segmentation” and is still an active area of research (Ma, Ganchev, and Weiss 2018; Huang et al. 2020).

We are not going to go into depth in this area, but we want to showcase that word segmentation is indeed possible with R as well. We use the jiebaR package (Wenfeng and Yanyi 2019). It is conceptually similar to the tokenizers package, but we need to create a worker that is passed into segment() along with the string we want to segment.

library(jiebaR)
## Loading required package: jiebaRD
words <- c("下面是不分行输出的结果", "下面是不输出的结果")

engine1 <- worker(bylines = T)

segment(words, engine1)
## [[1]]
##  [1] "U"    "4"    "E0B"  "U"    "9762" "U"    "662"  "F"    "U"    "4"   
## [11] "E0D"  "U"    "5206" "U"    "884"  "C"    "U"    "8"    "F93"  "U"   
## [21] "51"   "FA"   "U"    "7684" "U"    "7"    "ED3"  "U"    "679"  "C"   
## 
## [[2]]
##  [1] "U"    "4"    "E0B"  "U"    "9762" "U"    "662"  "F"    "U"    "4"   
## [11] "E0D"  "U"    "8"    "F93"  "U"    "51"   "FA"   "U"    "7684" "U"   
## [21] "7"    "ED3"  "U"    "679"  "C"

2.7 Summary

  • To build a predictive model, text data needs to be split into meaningful units, called tokens.

  • Fast and consistent tokenizers are available, but understanding how they behave and in what circumstances they work best will set you up for success.

  • Once text data is tokenized, a common next preprocessing step is to consider how to handle very common words that are not very informative— stop words.

