Next Word Prediction

Metin Turgal
18.04.2016

Introduction

This is the Capstone project for the Coursera Data Science Specialization.

Basically this application predicts the next word of the text written by the user with the help of the language model build on a sample twitter, blogs and news data set.

The Language Model

The steps to build the model are as follows:

Downloading the data
Preprocessing the data
- Cleaning from profanity and punctuation
- Uniforming the format of the words
- Tokenization
Frequency Analysis
Building the Algorithm
- Frequency Sorting
- Stupid Back off Smoothing'

The Algorithm

The algorithm of the language model consists of the following steps:
1. Take manageable samples of the data.
2. Build bigrams trigrams and four-grams with frequencies.
3. Sort them according to the frequencies.
4. The match with the longest Ngram gains priority according the frequency.
5. If not found in n-grams the most frequent word is given as the prediction.

The Application Usage

The application interface is simple: The user enters text and upon pressing submit button application provides the user with the next word predicted from the model.

Even though the prediction can be based on bigrams, trigrams and four-grams seperately, this is avoided to achieve a leaner and simpler interface.
Further instructions are given in a separate tab in the application.
Too see the app:
Next Word Prediction App