Data Science Capstone NLP Word Prediction

Farzaneh Bina

08/03/2024

Word Prediction using SwiftKey data

We constructed two algorithms that learn from the SwiftKey datasets https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip for the english language (US) and predict the next word given an input sentence. The complete dataset was about 556 MBs in size and had more than lines of text. However, we choose an application that sampled only 3% of that Data at random and built models that take about 6 seconds to load in a shiny app.

most frequent words
most frequent words

General Outline:

The algorithm accepts the word(s) \(w_{1}^{n-1}=w_1,w_2,...,w_n\), then tokenizes the sentence into lexems. We aim to predict the next word by firstly taking into consideration the largest n-gram, then the (n-1)-gram down to the (n-2)-gram. For each case we match the entered n-gram with those ones that were extracted from the Corpus. We select and output the top 3 words \(w_n\) with the highest Probabilities \(P(w_n|w_1^{n-1})\). At the end, we return a list of at most 3 top word suggestions for each n-gram case. Therefore, an entered n-gram will obtain a n sized list of top frequency suggestions.

Algorithms

The Exact Match with Backoff (EM) Algorithm

Given an n-gram \(g=w_1^{j}=(w_1,w_2,...,w_j)\), obtain a word prediction \(h_k\) being \(k \leq j\):

\(h(g^k)=h_k\), \(h(g^k)=argmax_{w_k+1}(P(w_{k+1}|w_1^{k}))\)

using the following approach:

  1. Clean the output by removing, punctuation, badwords, spaces and other elements from the word(s). Tokenize and construct an n-gram with the cleanned input. If there are more than 3 input words, just take the last 3 words.
  2. Predict the most likely word \(w_{k+1}\) with maximum probability: \(P(w_n|w_1^{n-1})= \frac{counts(w_1,w_2,...,w_n)}{counts(w_1,w_2,...,w_{n-1})}\).
  3. If less than three suggestions have been found, then we reduce the input words \(g^k\) to lower level \((j-k)grams\) \(g^{k-1}\) and repeat from step 3..

The Historical Track Algorithm

Predict \(h_k\) by finding unobserved lexems in the input word \(g^k\):

\(T(g^k)=argmax_{w_k+1}(P(w_k+1|V_1^k))\) where \(V=W \cup Z\) where \(W\) are observed words and \(Z\) are unobserved words.

  1. We clean and tokenize the input and obtain an input n-gram predictor \(g^k\).
  2. Divide \(g^k=w_1^n\) into pairs \((w_i,w_{i-1})\) according to the chain rule applied to the bigram model: \(P(w_n|w_1^{n-1}) = P(w_1)P(w_2|w1)P(w_3|w_2),...,P(w_n|w_{n-1})\)
  3. For each pair of words \((w_i,w_{i-1}) \in w_1 ^n\):

Experimental Results: Accuracy

Experimental Results: Perplexity

Shiny App Online

Best results obtained by using the EM + HT algorithms.

Shiny App! The App provides:

Shiny app available at https://frznhbn.shinyapps.io/ShinyApp/.