Haowei song
Nov. 25th 2017
The objective of the Captone project is to build an application to predict the next word(s) from a partial sententce entered by the user.In this project, a n-gram model combining with both Good-Turing backoff and Kneser-Ney algorithm was used for predicting the next word based on the previous partial sentence.
Raw data files:
Library used: plyr. data.tabl, tm, openNLP, reshape2
A subset of the original data was sampled from the three sources (blogs,twitter and news) and merged into one corpus
Data cleaning is done by conversion to lowercase, strip white space, and removing punctuation and numbers.
The corresponding n-grams (Quadgram, Trigram, Bigram and Unigram) were then created by using “RWeka” package
The term-count tables are extracted, sorted and presented according to the frequency in descending order.
Here is the link to the milestone report:
Back-off algorithm:
Briefly, if the N-gram we need has zero counts, we approximate it by backing off to the (N-1)-gram. We continue backing off until we reach a history that has some counts.
Kneser-Ney algorithm:
A shiny app was created. The appp has a textinput box allows users to type in a partial, and a verbatimTextOutput box that will present the top three predicted words when the “next word” action button was clicked.
Below is the screeen shot of the shiny app interface: