Farzaneh Bina
08/03/2024
We constructed two algorithms that learn from the SwiftKey datasets https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip for the english language (US) and predict the next word given an input sentence. The complete dataset was about 556 MBs in size and had more than lines of text. However, we choose an application that sampled only 3% of that Data at random and built models that take about 6 seconds to load in a shiny app.
The algorithm accepts the word(s) \(w_{1}^{n-1}=w_1,w_2,...,w_n\), then tokenizes the sentence into lexems. We aim to predict the next word by firstly taking into consideration the largest n-gram, then the (n-1)-gram down to the (n-2)-gram. For each case we match the entered n-gram with those ones that were extracted from the Corpus. We select and output the top 3 words \(w_n\) with the highest Probabilities \(P(w_n|w_1^{n-1})\). At the end, we return a list of at most 3 top word suggestions for each n-gram case. Therefore, an entered n-gram will obtain a n sized list of top frequency suggestions.
Given an n-gram \(g=w_1^{j}=(w_1,w_2,...,w_j)\), obtain a word prediction \(h_k\) being \(k \leq j\):
\(h(g^k)=h_k\), \(h(g^k)=argmax_{w_k+1}(P(w_{k+1}|w_1^{k}))\)
using the following approach:
Predict \(h_k\) by finding unobserved lexems in the input word \(g^k\):
\(T(g^k)=argmax_{w_k+1}(P(w_k+1|V_1^k))\) where \(V=W \cup Z\) where \(W\) are observed words and \(Z\) are unobserved words.
Best results obtained by using the EM + HT algorithms.
The App
provides:
Shiny app available at https://frznhbn.shinyapps.io/ShinyApp/.