Natural language processing describes the field of computer science concerned with analyzing a “language that has evolved naturally in humans through use or repetition…” (6). Spoken and written language have been used throughout human history as the main mode for conveying meaning between individuals. These languages are intimately tied to human culture and might be considered the earliest form of recorded data.
Initial efforts in natural language processing focused on “hard-coding” linguistic rules into programs (i.e. if/then decision trees); however, this process was ineffective when dealing with the many exceptions and nuances contained within language. Machine learning revolutionized the field with its use of statistical inference and models that assign probabilities to an array of linguistic decisions. It produces more robust algorithms that can handle novelty and automatically learn linguistic rules (6).
The goal of this project is to develop a machine learning algorithm that predicts the next word coming in a sentence. Engines like this already exist today, like the word suggestions that appear above the iOS keyboard and update as the user continues composing a message. Or suggested Google search terms that appear in a drop down list as the user types.
A large corpus of English texts (over 4.2 million lines) were mined from three online sources: blog posts, digital news articles, and twitter feeds. A subset of this corpus was randomly sampled (due to memory constraints), processed for statistical analysis, and explored in preparation for modeling.
This prediction algorithm will be used in the context of casual communications such as texting, emails, and social media posts; therefore, it is assumed that the resulting corpus is a representative sample of the English language used in such casual contexts.
The corpus was downloaded from a source named “HC Corpora” (download zip). It contains English, Russian, Finnish, and German texts; however, this report only focuses on English.
Also, a list of 722 English profanity words was downloaded from “Front Gate Media” (1). This allowed for filtering out profanity from the corpus so that the project results in a clean prediction engine.
The English corpus contains three documents of text mined from: blog posts, digital news articles, and twitter feeds. Each were read into R separately for evaluation.
The data table below shows that the entire corpus is composed of over 4.2 million lines of English text, totaling over 100 million words.
| Documents | Lines (million) | Words (million) |
|---|---|---|
| Blog | 0.90 | 39.12 |
| News | 1.01 | 36.72 |
| 2.36 | 32.79 | |
| Total | 4.27 | 108.63 |
The prediction algorithm will be trained on all three documents for a representative sample of the English language in a digital context. The three documents are randomly sampled and merged into a single character vector. For the sake of processing speed and preventing memory issues, the sample will only draw from 60% of the entire corpus.
From here, the corpus is subjected to a series of processing steps in order to prepare it for quantification and other subsequent statistical analyses.
The processing steps include:
Remove all punctuation
Apply profanity filter (custom R-script that scans corpus and removes word contained in a list of common English profanity)
Convert all letters to lower-case
Remove all numeric characters
Remove extra white space (double spaces)
| Before Processing | After Processing |
|---|---|
| All reports suggest they’ll do whatever they can to nab this hard-hitting safety and they should. Barron is an absolute ball hawk and a menace once he gets into the backfield. Defensive coordinator Rob Ryan will have a field day with his abilities and he can be much more flexible inside a complicated scheme. | all reports suggest theyll do whatever they can to nab this hardhitting safety and they should barron is an absolute ball hawk and a menace once he gets into the backfield defensive coordinator rob ryan will have a field day with his abilities and he can be much more flexible inside a complicated scheme |
| Apart of me is gone once you left me. </3 | apart of me is gone once you left me |
| After the Womack Brothers failed to set the world on fire with their gospel single, they rechristened themselves the Valentinos in 1962 and reworked the gospel standard “I Couldn’t Hear Nobody Pray,” complete with new lyrics and a new title: “Lookin’ for a Love.” | after the womack brothers failed to set the world on fire with their gospel single they rechristened themselves the valentinos in and reworked the gospel standard i couldnt hear nobody pray complete with new lyrics and a new title lookin for a love |
| “: We’re back in Nashville and ready to go to work. #wingsin7 #believe” | were back in nashville and ready to go to work wingsin believe |
This processing converted the data into a format that can be quantified, explored, and analyzed. Currently the data exists in character strings, each comprised of a variable amount of words.
The first step in NLP analysis is to break the entire corpus down to individual words (or “tokenize” it). Then we may get frequency counts for each word in the corpus.
Adding on to this concept, we can break the entire corpus into its 2-word pairs (bigrams), 3-word pairs (trigrams), and 4-word pairs (quadgrams). These are known as n-grams and they describe another fundamental concept in statistical NLP. Again, frequency counts can be generated for these n-grams in order to determine how often pairs of words occur in the corpus.
Note: One can start to see how this relates to the overall objective of this NLP engine. We want to predict the next word coming in a sentence so it will be helpful to understand how often a certain word comes after another word or phrase.
The above steps in the EDA generate corpus features in the form of unigrams, bigrams, trigrams, and quadgrams.
The four bar charts below show the top 10 most frequently occurring n-grams (respectively). These returned words and phrases make intuitive sense as the most frequently occurring. This finding is further validated by the fact that they all match “stopword” lists found in text cleaning packages. Some NLP applications benefit from the removal of “stopwords” (or frequently occurring words that convey little meaning) but not in this case. This NLP engine needs the ability to suggest stopwords wherever appropriate within a partial sentence.
The following section describes the theory and development of the implemented natural language model.
Currently, the n-gram frequencies are based on how many times they occur in the training corpus; therefore, features that do not occur in this corpus are assigned zero probability. This is a problem because a corpus cannot contain all possible words and n-grams in the English language (2).
Good estimates the probability of these unseen events by (3): \[ P_{unseen} = \frac{N_1}{N} \]
Where \(N_1\) equals the number of n-grams that occur once in the corpus and \(N\) is the total number of n-grams observed.
Data sparsity like this requires that we move some of the current probability mass to account for these “unseen” observations [^2]. The below equation shows the Good-Turing estimator used in calculating the adjusted count (\(c^*\)).
\[ c^* = (c + 1)\frac{N_{c+1}}{N_c} \]
Where \(c\) is the current un-adjusted count and \(N_c\) is the number of n-grams observed c-times in the corpus.
Kneser-Ney smoothing is a language model that accounts for the probabilities of high order and lower order language models (7). This recursion is achieved by distributing the discounted probability mass onto the lower order models.
The above discounting method is modified with a variable discounting value dependent on the frequency count of the n-grams (7).
\(D_1 = 1 - 2D \frac{N_2}{N_1}\)
\(D_2 = 2 - 3D \frac{N_3}{N_2}\)
\(D_3+ = 3 - 4D \frac{N_4}{N_3}\)
where \(D = \frac{N_1}{N_1 + 2N_2}\)
The weight given to these lower order models depends on continuation probabilities that assess the number of contexts a word can appear. The bigram “San Francisco” might have a relatively high count but the unigram “Francisco” would be assigned a low continuation probability since it may only ever appear after the word “San” (5).
Below shows the Fourth Order Kneser-Ney equation implemented using a data table containing quadgram phrases and corresponding frequency counts (4):
\[ P_{KN}(w_i|w^{i-1}_{i-3}) = \frac{c(w_{i-3}^i ) - D}{c(w_{i-3}^{i-1})} + \frac{D * N_{1+}(w_{i-3}^{i-1}\bullet)}{c(w_{i-3}^{i-1})} * P_3 \]
\[ P_3 = \frac{N_{1+}(\bullet w_{i-2}^i) - D}{N_{1+}(\bullet w_{i-2}\bullet)} + \frac{D * N_{1+}(w_{i-2}\bullet)}{N_{1+}(\bullet w_{i-2} \bullet)} * P_2 \]
\[ P_2 = \frac{N_{1+}(\bullet w_{i-1}^i) - D}{N_{1+}(\bullet w_{i-1}\bullet)} + \frac{D * N_{1+}(w_{i-1}\bullet)}{N_{1+}(\bullet w_{i-1} \bullet)} * P_1 \]
\[ P_1 = \frac{N_{1+}(\bullet w_i)}{N_{1+}(\bullet \bullet)} \]
The code below illustrates the Modified Kneser-Ney model implemented in R:
#1. Apply modified discounting to all quadgrams based on the number of times each quadgram is observed in the corpus
aD <- (nrow(dt.4[count == 1])) / (nrow(dt.4[count == 1]) +
2*(nrow(dt.4[count == 2])))
dt.4[,D := 0]
dt.4[count==1, "D"] <- 1 - 2*aD*(nrow(dt.4[count == 2])/nrow(dt.4[count == 1]))
dt.4[count==2, "D"] <- 2 - 3*aD*(nrow(dt.4[count == 3])/nrow(dt.4[count == 2]))
dt.4[count > 2, "D"] <- 3 - 4*aD*(nrow(dt.4[count == 4])/nrow(dt.4[count == 3]))
dt.4[, c.adj := count - D]#2. Calculate all variables in the Modified Kneser-Ney algorithm
#Calcuating count of preceding quadgrams
dt.4 <- dt.4[,.(w.i, count, D, c.adj, quad.Denom = sum(count)), by= list(w.3, w.2, w.1)]
#Calculating continuation counts for quad, tri, and bigram
dt.4 <- dt.4[,.(w.i, count, D, c.adj, quad.Denom, quad.Cont = .N), by= list(w.3, w.2, w.1)]
dt.4 <- dt.4[,.(w.3, w.i, count, D, c.adj, quad.Denom, quad.Cont, tri.Cont = .N), by= list(w.2, w.1)]
dt.4 <- dt.4[,.(w.3, w.2, w.i, count, D, c.adj, quad.Denom, quad.Cont, tri.Cont, bi.Cont = .N), by= list(w.1)]
#Calculate lambda 1
dt.4 <- dt.4[,.(w.3, w.2, w.1, w.i, count, D, c.adj, tri.Cont, bi.Cont, quad.Denom, l1 = (D*quad.Cont)/(quad.Denom))]
#Third-Order KN continuation variables
dt.4 <- dt.4[,.(w.3, count, D, c.adj, tri.Cont, bi.Cont, l1, quad.Denom, N.3 = (.N - D)), by= list(w.2, w.1, w.i)]
dt.4 <- dt.4[,.(w.3, w.i, count, D, c.adj, tri.Cont, bi.Cont, l1, quad.Denom, N.3, D.3 = .N), by= list(w.2, w.1)]
dt.4 <- dt.4[,.(w.3, w.2, w.1, w.i, count, D, c.adj, tri.Cont, bi.Cont, quad.Denom, l1, D.3, N.3, p.Cont.3 = (N.3/D.3))]
#Calculate lambda 2
dt.4 <- dt.4[,.(w.3, w.2, w.1, w.i, count, D, c.adj, bi.Cont, p.Cont.3,quad.Denom, l1, l2 = (D*tri.Cont)/(D.3))]
#Second-Order KN continuation variables
dt.4 <- dt.4[,.(w.3, w.2, count, D, c.adj, bi.Cont, l1, l2,quad.Denom, p.Cont.3, N.2 = (.N - D)), by= list(w.1, w.i)]
dt.4 <- dt.4[,.(w.3, w.2, w.i, count, D, c.adj, bi.Cont, l1, l2, quad.Denom, p.Cont.3, N.2, D.2 = .N), by= list(w.1)]
dt.4 <- dt.4[,.(w.3, w.2, w.1, w.i, count, D, c.adj, bi.Cont,quad.Denom, l1, l2, D.2, p.Cont.3, p.Cont.2 = (N.2/D.2))]
#Calculate lambda 3
dt.4 <- dt.4[,.(w.3, w.2, w.1, w.i, count, D, c.adj, quad.Denom, bi.Cont, p.Cont.3, l1, l2, p.Cont.3, p.Cont.2, l3 = (D*bi.Cont)/(D.2))]
length <- length(unique(dt.4[,w.i]))
#Calculating lowest-order, unigram continuation probability
dt.4 <- dt.4[,.(w.3, w.2, w.1, count, D, c.adj, quad.Denom, p.Cont.3, l1, l2, p.Cont.2, l3, p.Cont.1 = .N/length), by= w.i]#3. Finally, calculate the final MKN probabilities for each quadgram. And subset the data table to minimize occupied disk space.
dt.4 <- dt.4[, .(w.3, w.2, w.1, w.i, l2, l1, quad.Denom, c.adj, p.Cont.3, p2 = p.Cont.2 + l3*p.Cont.1)]
dt.4 <- dt.4[, .(w.3, w.2, w.1, w.i, c.adj, quad.Denom, l1, p3 = p.Cont.3 + l2*p2)]
dt.4 <- dt.4[, .(w.3, w.2, w.1, w.i, p.f = (c.adj/quad.Denom) + l1*p3)]This was repeated for lower order data tables in order to calculate MKN probabilities of trigrams, bigrams, and unigrams..
A shiny application was developed to allow for real-time user interaction and sentence prediction.
Fourth-order, third-order, second-order, and first-order data tables are all loaded into the application to allow for recursive modeling.
Fourth-order models are used when the user enters three or more words. The last three words from this input are used to subset the data tables and return the top ten highest probability grams. The last word is extracted from each gram and returned as the word prediction for that gram. If less than ten grams are returned then the fourth-order predictions are preserved and the model backs off to third-order predictions. This recursive process continues down to unigram vocabulary probabilities until ten predictions have been identified.
Note: Modeling begins with lower order trigrams or bigrams if the user enters less than three words.
The main dashboard features one user input and two outputs.
The user inputs a partial sentence in a field located in the upper left-hand corner, titled “Sentence Input”. This generates a character vector with which data tables are subset.
The entire second row features a bar chart displaying the top 10 word predictions based on the partial sentence input by the user. The y-axis charts relative probabilities and and column colors indicate the rank-order model that generated the prediction.
Note 1: There are not always 10 unique words. When the model backs off to lower orders, the lower order predictions will often generate some of the same predictions made by the higher order models. In this case, the probabilities made by higher and lower-order models add together and present a cumulative, relative probability.
Note 2: Upon application initialization, the graph features the unigram vocabulary probabilities (top 10 most common words in the corpus) as no words have been input by the user.
The last output is featured in the upper-right hand corner of the dashboard. It displays the partial sentence input by the user plus the highest probability prediction. In other words, the sentence is updated in real-time with the best prediction.
A List Of 723 Bad Words To Blacklist & How To Use Facebook’s Moderation Tool. (2014, May 12). Retrieved June 01, 2018, from https://www.frontgatemedia.com/a-list-of-723-bad-words-to-blacklist-and-how-to-use-facebooks-moderation-tool/
Chen, S. F., & Goodman, J. (1999). An empirical study of smoothing techniques for language modeling. Computer Speech & Language,13(4), 359-393. doi:10.1006/csla.1999.0128
Good, I. J. (1953). The Population Frequencies of Species and the Estimation of Population Parameters. Biometrika,40(3/4), 237. doi:10.2307/2333344
Körner, M. C. (2013). Implementation of Modified Kneser-Ney Smoothing on Top of Generalized Language Models for Next Word Prediction. Institute for Web Science and Technologies. Retrieved June 01, 2018, from https://west.uni-koblenz.de/sites/default/files/BachelorArbeit_MartinKoerner.pdf.
Milli, S. (2015, June 30). Kneser-Ney Smoothing. Retrieved from http://smithamilli.com/blog/kneser-ney/
Natural language. (2018, July 10). Retrieved from https://en.wikipedia.org/wiki/Natural_language
Ney, H., Essen, U., & Kneser, R. (1994). On structuring probabilistic dependences in stochastic language modelling. Computer Speech & Language,8(1), 1-38. doi:10.1006/csla.1994.1001