Kiichi Takeuchi
April 26, 2015
Word Prediction is one of the most important text mining techniques in order to speed up user's input process. In limited User Interface environment, such as smartphone application, predicting next word is key feature of the platform because the keyboard size and functionalities are limited; users will get huge benefit by saving keystroke.
The goal of this project is to build an app that takes user input, and display predicted word. Additionally, the user can select a specific data source from three different corpora: Blog, News and Twitter, and the other candidates of predicted words are displayed with the final score.
Blog, News and Twitter files have been preprocessed as below:
Total file size: 47MB
The app is calculating the probability of next word based on preceding words using Markov assumption.
Four major options are examined for smoothing algorithm:
I decided to implement the interpolation smoothing as the primary algorithm since it's fast enough to compute, and it's giving reasonable results. Additionally, some weight from the Kneser-Ney smoothing contributes for lower order of N-Grams.
https://kiichi.shinyapps.io/product/
Try “San”,“I love you so”, etc… Also type “Thanks for”, and switch between Blog and Twitter. See it picks up “following” v.s. “reading” in the ranking?
A quick improvement could be implemented by using the current user's input history. In Shiny App, user specific session data is available, and this could be useful to store the history data per user. The user corpus should provide more weight on top of the general frequency table.
I would definitely consider to implement Generalized Language Model (GLM) since it'll be add-on improvement by creating another corpus. This clould be done by replacing sparse words with
Also, there is room to speed up and minize the storage size. I can think of better data structure, such as Trie that is based on O(Log) efficiency.