NLP Milestone Report

Introduction

Natural Language Processing (NLP) is a branch of artificial intelligence (AI) that focuses on the interaction between computers and human language. It encompasses the methods and techniques used to enable computers to understand, interpret, and generate human language in a way that is both meaningful and useful. NLP plays a crucial role in bridging the gap between human communication and computer systems, enabling machines to process and analyze vast amounts of textual data.

One of the key applications of NLP is text prediction, where algorithms are designed to generate probable next words or phrases based on the context of the input text. Text prediction relies on various NLP techniques such as language modeling, statistical analysis, and machine learning algorithms. By utilizing large amounts of text data and training models on it, NLP systems can accurately predict the most likely next words or phrases, leading to improved writing assistance and productivity tools.

Text prediction has a wide range of use cases across different domains. In the context of email communication, NLP-based text prediction can suggest relevant phrases, complete sentences, or even compose entire email responses, saving time and effort for users. In addition, NLP-powered text prediction is widely used in search engines and autocomplete features, providing users with suggested search queries or completing their search queries based on historical data and patterns.

Text prediction is also valuable in the context of writing and content creation. Content creators, including authors, journalists, and bloggers, can benefit from NLP-powered text prediction to generate suggestions for the next word, sentence, or paragraph, aiding in the creative process and improving overall writing efficiency. Moreover, in the field of customer support, chatbots and virtual assistants employ text prediction to generate automated responses that closely match customer queries, enhancing customer service and reducing response time.

Objective

This report will detail building a next word text prediction model using source data from the company SwiftKey. The source data comprises input from news sites, blogs, and twitter posts.

The model will be built using NLP techniques in R with the following steps

Building a corpus from the source data
Constructing ngrams from the corpus
Selecting a modeling technique to balance accuracy and speed
Constructing the selected (or multiple candidates) on the ngram data
Testing the models on test data
Building a shiny web app to allow predictions on input text

This will be completed within the next the next seven weeks.

Source Data Exploration

Data is loaded from the US English versions of the files.

library(readr)

if(!exists("twitterData")) {
  con <- file("./final/en_US/en_US.twitter.txt","rb") 
  twitterData <- read_lines(con)
  close(con) 
}

if(!exists("newsData")) { 
  con <- file("./final/en_US/en_US.news.txt","rb") 
  newsData <- read_lines(con)
  close(con) 
}

if(!exists("blogsData")) {
  con <- file("./final/en_US/en_US.blogs.txt","rb") 
  blogsData <- read_lines(con)
  close(con) 
}

The data is quite large: ~4M lines. Building the initial corpus, ngrams, and model would be time consuming at this size so the initial work is performed on a 10% subset of the data.

library(dplyr)
set.seed(42)
include1 <- rbinom(n=length(blogsData), size=1, prob=0.1)==1
include2 <- rbinom(n=length(newsData), size=1, prob=0.1)==1
include3 <- rbinom(n=length(twitterData), size=1, prob=0.1)==1

blogs_selected <- tibble(source="blogs",text=blogsData[include1])
news_selected <-tibble(source="news",text=newsData[include2])
twitter_selected <- tibble(source="twitter",text=twitterData[include3])

Here are the stats on the whole dataset versus the 10% subset

	Number of rows	Max line length	Object size (MB)
Blogs	899288	40833	255
Blogs (subset)	89812	37191	26
News	1010242	2363	257
News (subset)	101364	11384	27
Twitter	2360148	140	319
Twitter (subset)	236335	140	34

Build the corpus

library(quanteda)

if(!exists("sample_df")) {
  # load the sample data
  # sample data is a combine data frame of the blog, news, and twitter dfs
  sample_df <- read_rds("./sample.rds")
}

# build a corpus
corp <- corpus(sample_df, text_field = "text")

# rename the docnames to be the source plus line number
docid <- paste(corp$source)
docnames(corp) <- docid

object.size(corp)/2^20

## 148.8 bytes

Build tokens

The initial token set is built from the corpus with punctuation, symbols, and urls removed.

toks <- tokens(corp,remove_punct = TRUE,remove_symbols=TRUE,remove_url=TRUE)

An initial glance at the data shows that the most common tokens are words like the, and, to, a, etc.

library(ggplot2)
library(scales)
dfm_tokens <- dfm(toks)

top <- tibble::enframe(topfeatures(dfm_tokens,n=20))
colnames(top) <- c("word","n")

# bar chart of most common
# maybe change the theme of the ggplot
top %>%
  ggplot(aes(n, reorder(word,n))) +
  geom_col() +
  labs(x="Frequency",y = "Token") +
  scale_x_continuous(labels = scales::label_number_si()) +
  theme_light()

This is consistent with a principle called Zipf’s Law. Specific to NLP, it states that when the words (in this case tokens) are sorted in descending order, the frequency of the nth term is inversely proportional to n. In English, “the” accounts for 7% of word occurrences, “of,” the second place word, accounts for 3.5%. Our set appears to follow this distribution.

Another interesting look is a cumulative distribution chart showing the number of tokens required to cover a certain percentage of the word count. Here we see that 0.067% of the tokens account for 50% of the word frequencies and 3.7% of the tokens account for 90% of the word frequencies.

Cleaning the data

The hunspell library and package was used to select tokens that do not appear in the hunspell dictionary. Hunspell is a spell checker and morphological analyzer currently used by programs such as Chrome, Firefox, and OpenOffice for spell checking.

Running the tokens through the hunspell checker showed that 54% of tokens were not in the dictionary. This seemed high at first glance, but a deeper analysis showed that only 3.2% of frequency was represented by these tokens. So they comprised a long tail of tokens.

library(hunspell)
totals <- tibble::enframe(sort(colSums(dfm_tokens),TRUE))
totals <- totals %>% 
  mutate(prop=value/sum(value)) %>%
  mutate(cum_val=cumsum(value)/sum(value)) %>%
  mutate(feature_num=row_number()) %>% mutate(feature_pct=feature_num/nrow(totals))

not_in_dict <- totals %>% filter(hunspell_check(toupper(name))==FALSE)
nrow(not_in_dict)/nrow(totals)

## [1] 0.5488282

sum(not_in_dict$prop)

## [1] 0.03219276

These tokens were removed from the token set.

toks_clean <- tokens_select(toks, pattern = not_in_dict$name, selection = "remove")

# check to see if everything is in the dictionary
dfm_clean <- dfm(toks_clean)
table(hunspell_check(toupper((tibble::enframe(sort(colSums(dfm_clean),TRUE)))$name)))

## 
##   TRUE 
## 103303

Ngrams

The next step is to construct ngrams from token list. Ngrams are sequential combinations of tokens that capture word sequence. For example in the prior sentence a 5 length ngram would be “are sequential combinations of tokens.”

The idea behind constructing ngrams are they can be used for prediction. With a string of four words, the 5 length ngram with the highest frequency is the highest probability match. The second highest frequency would be the second most probable match.

The ngrams are constructed from the cleaned tokens. This is a time intensive process so they are loaded from memory below (code commented out).

# create ngrams

#toks_ngram2 <- tokens_ngrams(toks_clean, n=2)
#toks_ngram3 <- tokens_ngrams(toks_clean, n=3)
#toks_ngram4 <- tokens_ngrams(toks_clean, n=4)
#toks_ngram5 <- tokens_ngrams(toks_clean, n=5)

# calculate ngrams probabilities
#dfm2 <- dfm(toks_ngram2)
#dfm3 <- dfm(toks_ngram3)
#dfm4 <- dfm(toks_ngram4)
#dfm5 <- dfm(toks_ngram5)

#ngram_tot2 <- tibble::enframe(sort(colSums(dfm2),TRUE))
#ngram_tot3 <- tibble::enframe(sort(colSums(dfm3),TRUE))
#ngram_tot4 <- tibble::enframe(sort(colSums(dfm4),TRUE))
#ngram_tot5 <- tibble::enframe(sort(colSums(dfm5),TRUE))

ngram2_tot <- read_rds("./2gram.rds")
ngram3_tot <- read_rds("./3gram.rds")
ngram4_tot <- read_rds("./4gram.rds")
ngram5_tot <- read_rds("./5gram.rds")

Looking at the ngram5 table shows an interesting phenomenon. Most (98.6%) of the ngrams only occur once in the corpus. More work needs to be done here to determine whether to include or exclude these.

The following chart shows the most common ngrams with length five.

# bar chart of most common
ngram5_tot[1:20,] %>%
  ggplot(aes(value, reorder(name,value))) +
  geom_col() +
  labs(x="Frequency",y = "ngram") +
  scale_x_continuous(labels = scales::label_number_si()) +
  theme_light()

Model considerations

Based on background reading, there are multiple approaches for text prediction using ngrams.

N-gram models with Laplace or Add-one Smoothing

An n-gram model predicts the next word in a sequence based on the previous n-1 words. It’s built by calculating the probabilities of word occurrence given the preceding group of words in the training corpus. With Laplace smoothing (also known as Add-one smoothing), one is added to every count to avoid the issue of zero probability for unseen n-grams. This method is relatively simple to implement but doesn’t perform as well with large vocabularies or infrequent n-grams because it assigns equal probabilities to all unseen n-grams.

Back-off Models

Back-off models provide a balance between complexity and performance. If a higher-order n-gram does not exist for the given context, this model “backs off” to a lower-order n-gram. For example, if a five-gram isn’t found in the corpus, the model backs off to a four-gram, and so on. This approach helps reduce the computational requirements compared to more complex models and performs relatively well for unseen n-grams, but it can still suffer from data sparsity issues.

Interpolation or Kneser-Ney Smoothing

Kneser-Ney smoothing is considered one of the most effective techniques for next-word prediction using n-grams. It not only discounts higher-order n-gram counts but also weights these discounted probabilities with lower-order n-gram probabilities. This model improves upon back-off models by using all n-gram counts rather than just the highest available one. However, it’s more computationally expensive and complex to implement compared to other techniques. It also requires a good understanding of the underlying algorithms and a large amount of data to work effectively.

Next steps

The next step is to code each of these models and stop when a suitable accuracy is achieved. That accuracy has not been determined yet and will likely be measured by the ability to predict a set of common phrases.