August 2015
Data Science Capstone
Johns Hopkins Bloomberg School of Public Health
Final Project Presentation
The aim of this kid of software is to predict the next word while user is entering text. It is frequently used for mobile devices. The idea is to improve text typing speed. Prediction is achieved by analyzing large sets of real text. For instance: blogs, webpages, twitter and a wide variety of social network text. For this project the corpora of text is
This data contains info from Blogs, News and Twitter.
Text data is usually proceeded with Text Mining and Natural Language Processing Techniques.
There are several algorithms for NWP. Due shinny.io server space and resource restrictions, we use simple Backoff Algorithm:
As user inputs text, we get the last N-1 (N is number or typed words) words and search at NGram for coincidence for highest frequencies. If there is no coincidence, the algorithm search at N-1 Gram, if not coincidence yet search at N-2 Gram and so on in a recursive way until get a result. If not result is obtained after search at all NGrams then the most “popular” word form 1Gram is returned. This implementation uses 4,3,2 and 1 NGrams and samples of original data. This reduction is done for implementing app at free account of shinny.io.
The program can be improved by using training methods in order to find not consecutive words and by using full corpora for making the Ngrams
The app using is so simple: Just type your text at textbox (same cleaning process of corpora will be applied). The result string for search will be displayed as well as predicted Next Word.
App can be run at here
I hope you like it!.