Capstone Project: Word Prediction App

H. Kollera
2016-01-20

Motivation

The capstone project of the Coursera Data Science specialization deals with Natural Language Processing (NLP), which has a broad field of application like information retrieval or speech recognition.

Aim of this project is the development of an exemplary, web based word prediction app.

Data Source and Preparation

The training data set is based on texts from blogs, news and tweets, which are originally provided by HC Corpora. With respect to the aim of predicting words of phrases the preparation of the training set splits into three steps:

Basic cleaning like change to lower case, removing punctuation and additional whitespace characters.
Building n-grams from n=1..4 and counting their frequencies
Cleaning of unwanted n-grams, such as those with doubled words, numbers or profanities

The n-grams with the highest probability are collected in a probability table, which is used for a back-off algorithm.

Shiny App

Based on the developed data model an app was implemented with the shiny toolkit.

Usage is as simple as the app itself. Type or paste a phrase into the input text field. Immediately after the input of each word a prediction (5 words) is given for the next word.

The app is hosted on the shiny server under WordPredictionTester

Possible Improvements

There are some possibilites to improve the quality of the n-grams, i.e.

Usage of dictionaries to use only real words
Usage of synonyms to extend the basis of n-grams

On the other hand a self learning component should be included. Integrating the user input into the n-gram basis with a higher weight leads to an improvement of the prediction, because of a better context specific word prediction.