H. Kollera
2016-01-20
The capstone project of the Coursera Data Science specialization deals with Natural Language Processing (NLP), which has a broad field of application like information retrieval or speech recognition.
Aim of this project is the development of an exemplary, web based word prediction app.
The training data set is based on texts from blogs, news and tweets, which are originally provided by HC Corpora. With respect to the aim of predicting words of phrases the preparation of the training set splits into three steps:
The n-grams with the highest probability are collected in a probability table, which is used for a back-off algorithm.
Based on the developed data model an app was implemented with the shiny toolkit.
Usage is as simple as the app itself. Type or paste a phrase into the input text field. Immediately after the input of each word a prediction (5 words) is given for the next word.
The app is hosted on the shiny server under WordPredictionTester
There are some possibilites to improve the quality of the n-grams, i.e.
On the other hand a self learning component should be included. Integrating the user input into the n-gram basis with a higher weight leads to an improvement of the prediction, because of a better context specific word prediction.