Natural Language Processing (NLP) is a branch of artificial intelligence (AI) that focuses on the interaction between computers and human language. It encompasses the methods and techniques used to enable computers to understand, interpret, and generate human language in a way that is both meaningful and useful. NLP plays a crucial role in bridging the gap between human communication and computer systems, enabling machines to process and analyze vast amounts of textual data.
One of the key applications of NLP is text prediction, where algorithms are designed to generate probable next words or phrases based on the context of the input text. Text prediction relies on various NLP techniques such as language modeling, statistical analysis, and machine learning algorithms. By utilizing large amounts of text data and training models on it, NLP systems can accurately predict the most likely next words or phrases, leading to improved writing assistance and productivity tools.
Text prediction has a wide range of use cases across different domains. In the context of email communication, NLP-based text prediction can suggest relevant phrases, complete sentences, or even compose entire email responses, saving time and effort for users. In addition, NLP-powered text prediction is widely used in search engines and autocomplete features, providing users with suggested search queries or completing their search queries based on historical data and patterns.
Text prediction is also valuable in the context of writing and content creation. Content creators, including authors, journalists, and bloggers, can benefit from NLP-powered text prediction to generate suggestions for the next word, sentence, or paragraph, aiding in the creative process and improving overall writing efficiency. Moreover, in the field of customer support, chatbots and virtual assistants employ text prediction to generate automated responses that closely match customer queries, enhancing customer service and reducing response time.
This report will detail building a next word text prediction model using source data from the company SwiftKey. The source data comprises input from news sites, blogs, and twitter posts.
The model will be built using NLP techniques in R with the following steps
This will be completed within the next the next seven weeks.
Data is loaded from the US English versions of the files.
library(readr)
if(!exists("twitterData")) {
con <- file("./final/en_US/en_US.twitter.txt","rb")
twitterData <- read_lines(con)
close(con)
}
if(!exists("newsData")) {
con <- file("./final/en_US/en_US.news.txt","rb")
newsData <- read_lines(con)
close(con)
}
if(!exists("blogsData")) {
con <- file("./final/en_US/en_US.blogs.txt","rb")
blogsData <- read_lines(con)
close(con)
}
The data is quite large: ~4M lines. Building the initial corpus, ngrams, and model would be time consuming at this size so the initial work is performed on a 10% subset of the data.
library(dplyr)
set.seed(42)
include1 <- rbinom(n=length(blogsData), size=1, prob=0.1)==1
include2 <- rbinom(n=length(newsData), size=1, prob=0.1)==1
include3 <- rbinom(n=length(twitterData), size=1, prob=0.1)==1
blogs_selected <- tibble(source="blogs",text=blogsData[include1])
news_selected <-tibble(source="news",text=newsData[include2])
twitter_selected <- tibble(source="twitter",text=twitterData[include3])
Here are the stats on the whole dataset versus the 10% subset
| Number of rows | Max line length | Object size (MB) | |
|---|---|---|---|
| Blogs | 899288 | 40833 | 255 |
| Blogs (subset) | 89812 | 37191 | 26 |
| News | 1010242 | 2363 | 257 |
| News (subset) | 101364 | 11384 | 27 |
| 2360148 | 140 | 319 | |
| Twitter (subset) | 236335 | 140 | 34 |
library(quanteda)
if(!exists("sample_df")) {
# load the sample data
# sample data is a combine data frame of the blog, news, and twitter dfs
sample_df <- read_rds("./sample.rds")
}
# build a corpus
corp <- corpus(sample_df, text_field = "text")
# rename the docnames to be the source plus line number
docid <- paste(corp$source)
docnames(corp) <- docid
object.size(corp)/2^20
## 148.8 bytes
The initial token set is built from the corpus with punctuation, symbols, and urls removed.
toks <- tokens(corp,remove_punct = TRUE,remove_symbols=TRUE,remove_url=TRUE)
An initial glance at the data shows that the most common tokens are words like the, and, to, a, etc.
library(ggplot2)
library(scales)
dfm_tokens <- dfm(toks)
top <- tibble::enframe(topfeatures(dfm_tokens,n=20))
colnames(top) <- c("word","n")
# bar chart of most common
# maybe change the theme of the ggplot
top %>%
ggplot(aes(n, reorder(word,n))) +
geom_col() +
labs(x="Frequency",y = "Token") +
scale_x_continuous(labels = scales::label_number_si()) +
theme_light()
This is consistent with a principle called Zipf’s Law. Specific to NLP, it states that when the words (in this case tokens) are sorted in descending order, the frequency of the nth term is inversely proportional to n. In English, “the” accounts for 7% of word occurrences, “of,” the second place word, accounts for 3.5%. Our set appears to follow this distribution.
Another interesting look is a cumulative distribution chart showing
the number of tokens required to cover a certain percentage of the word
count. Here we see that 0.067% of the tokens account for 50% of the word
frequencies and 3.7% of the tokens account for 90% of the word
frequencies.
The hunspell library and package was used to select tokens that do not appear in the hunspell dictionary. Hunspell is a spell checker and morphological analyzer currently used by programs such as Chrome, Firefox, and OpenOffice for spell checking.
Running the tokens through the hunspell checker showed that 54% of tokens were not in the dictionary. This seemed high at first glance, but a deeper analysis showed that only 3.2% of frequency was represented by these tokens. So they comprised a long tail of tokens.
library(hunspell)
totals <- tibble::enframe(sort(colSums(dfm_tokens),TRUE))
totals <- totals %>%
mutate(prop=value/sum(value)) %>%
mutate(cum_val=cumsum(value)/sum(value)) %>%
mutate(feature_num=row_number()) %>% mutate(feature_pct=feature_num/nrow(totals))
not_in_dict <- totals %>% filter(hunspell_check(toupper(name))==FALSE)
nrow(not_in_dict)/nrow(totals)
## [1] 0.5488282
sum(not_in_dict$prop)
## [1] 0.03219276
These tokens were removed from the token set.
toks_clean <- tokens_select(toks, pattern = not_in_dict$name, selection = "remove")
# check to see if everything is in the dictionary
dfm_clean <- dfm(toks_clean)
table(hunspell_check(toupper((tibble::enframe(sort(colSums(dfm_clean),TRUE)))$name)))
##
## TRUE
## 103303
The next step is to construct ngrams from token list. Ngrams are sequential combinations of tokens that capture word sequence. For example in the prior sentence a 5 length ngram would be “are sequential combinations of tokens.”
The idea behind constructing ngrams are they can be used for prediction. With a string of four words, the 5 length ngram with the highest frequency is the highest probability match. The second highest frequency would be the second most probable match.
The ngrams are constructed from the cleaned tokens. This is a time intensive process so they are loaded from memory below (code commented out).
# create ngrams
#toks_ngram2 <- tokens_ngrams(toks_clean, n=2)
#toks_ngram3 <- tokens_ngrams(toks_clean, n=3)
#toks_ngram4 <- tokens_ngrams(toks_clean, n=4)
#toks_ngram5 <- tokens_ngrams(toks_clean, n=5)
# calculate ngrams probabilities
#dfm2 <- dfm(toks_ngram2)
#dfm3 <- dfm(toks_ngram3)
#dfm4 <- dfm(toks_ngram4)
#dfm5 <- dfm(toks_ngram5)
#ngram_tot2 <- tibble::enframe(sort(colSums(dfm2),TRUE))
#ngram_tot3 <- tibble::enframe(sort(colSums(dfm3),TRUE))
#ngram_tot4 <- tibble::enframe(sort(colSums(dfm4),TRUE))
#ngram_tot5 <- tibble::enframe(sort(colSums(dfm5),TRUE))
ngram2_tot <- read_rds("./2gram.rds")
ngram3_tot <- read_rds("./3gram.rds")
ngram4_tot <- read_rds("./4gram.rds")
ngram5_tot <- read_rds("./5gram.rds")
Looking at the ngram5 table shows an interesting phenomenon. Most (98.6%) of the ngrams only occur once in the corpus. More work needs to be done here to determine whether to include or exclude these.
The following chart shows the most common ngrams with length five.
# bar chart of most common
ngram5_tot[1:20,] %>%
ggplot(aes(value, reorder(name,value))) +
geom_col() +
labs(x="Frequency",y = "ngram") +
scale_x_continuous(labels = scales::label_number_si()) +
theme_light()
Based on background reading, there are multiple approaches for text prediction using ngrams.
An n-gram model predicts the next word in a sequence based on the previous n-1 words. It’s built by calculating the probabilities of word occurrence given the preceding group of words in the training corpus. With Laplace smoothing (also known as Add-one smoothing), one is added to every count to avoid the issue of zero probability for unseen n-grams. This method is relatively simple to implement but doesn’t perform as well with large vocabularies or infrequent n-grams because it assigns equal probabilities to all unseen n-grams.
Back-off models provide a balance between complexity and performance. If a higher-order n-gram does not exist for the given context, this model “backs off” to a lower-order n-gram. For example, if a five-gram isn’t found in the corpus, the model backs off to a four-gram, and so on. This approach helps reduce the computational requirements compared to more complex models and performs relatively well for unseen n-grams, but it can still suffer from data sparsity issues.
Kneser-Ney smoothing is considered one of the most effective techniques for next-word prediction using n-grams. It not only discounts higher-order n-gram counts but also weights these discounted probabilities with lower-order n-gram probabilities. This model improves upon back-off models by using all n-gram counts rather than just the highest available one. However, it’s more computationally expensive and complex to implement compared to other techniques. It also requires a good understanding of the underlying algorithms and a large amount of data to work effectively.
The next step is to code each of these models and stop when a suitable accuracy is achieved. That accuracy has not been determined yet and will likely be measured by the ability to predict a set of common phrases.