Word Predictor App - Coursera Data Science Capstone Project

Daniel Erzse
25th Apr 2015

Introduction

Word Predictor is a shiny app built as a part of Cousera Data Science Capstone Project.

The application represents a demonstration of implementing a predictive algorithm. Based on a sentence introduced by the user, the app tries to predict the next word in the sentence using Natural Language Processing techniques.

Summaries regarding

- Data aquisition and code preparation

- Prediction algorithm

- User interface

are presented in the following slides.

Data Aquisition and Code Preparation

Data aquisition. The raw data set used is made up of three text files corresponding to English language, the total size of the data being 580Mb.
Exploratory analysis. Was performed to understand the word distribution, unique words, usage frequency, top words used in each data source
Sampling. A random sample of 5% of the corpus was used for training the predictive algorithm.
Data cleaning. Punctuation, white spaces, non printable characters were removed and text was converted to lowercase.
Tokenization. Was done using the RWeka package. Unigrams, bigrams, trigrams and quadrigrams were generated and stored in 4 data tables.

Prediction Algorithm

The algorithm calculates the probabilities for unigrams, bigrams and trigrams.
Smoothing is used to calculate probabilities for n-grams not found in the corpus.
Similar steps are performed on the input text. The input text is cleaned (numbers, extra spaces, punctuation are removed) and it is converted to lowercase.
The last 3 words from the cleaned input text are used for prediction.
The algorithm searches for 3-grams first. If there is no result then it looks for 2-grams, and for 1-gram if there is no 2-gram result.
The top five results are returned by the algorithm.

Word Predictor App - Coursera Data Science Capstone Project

Introduction

- Data aquisition and code preparation

- Prediction algorithm

- User interface

Data Aquisition and Code Preparation

Prediction Algorithm

User Interface

- The app initializes by loading the model from the server.

- The app asks the user for an input sentence.

- Top five predicted words are generated based on input sentence.