Next Word Prediction

Metin Turgal
18.04.2016

Introduction

This is the Capstone project for the Coursera Data Science Specialization.

Basically this application predicts the next word of the text written by the user with the help of the language model build on a sample twitter, blogs and news data set.

The Language Model

The steps to build the model is as follows:

Downloading the data
Preprocessing the data
- Cleaning from profanity and punctuation
- Uniforming the format of the words
- Tokenization
Frequency Analysis
Building the Algorithm
- Frequency Sorting +'Stupid Back off Smoothing'

The Algorithm

The algorithm of the language model consists of the following steps
1. Take manageable samples of the data
2. Build 2-gram, 3- gram and 4-grams with frequencies
3. Sort them according to the frequencies
4. The match with the longest Ngram gains priority according the frequency.
5. If not found in N grams the most frequent word is given as the prediction.

The Application Usage

The application interface is simple: The user enters text and upon pressing submit button application provides the user with the next word predicted from the model.

Even though the prediction can be based on 2 word, 3 word and 4 word groups only the aggregated best result is shown to achieve a lean and simple interface. Even though the application is very self- explanatory, Instructions are also given in another tab in the application.