Word Predictor App - Coursera Data Science Capstone Project

Daniel Erzse
25th Apr 2015

Introduction

Word Predictor is a shiny app built as a part of Cousera Data Science Capstone Project.

The application represents a demonstration of implementing a predictive algorithm. Based on a sentence introduced by the user, the app tries to predict the next word in the sentence using Natural Language Processing techniques.

Summaries regarding

- Data aquisition and code preparation
- Prediction algorithm
- User interface

are presented in the following slides.

Data Aquisition and Code Preparation

  • Data aquisition. The raw data set used is made up of three text files corresponding to English language, the total size of the data being 580Mb.
  • Exploratory analysis. Was performed to understand the word distribution, unique words, usage frequency, top words used in each data source
  • Sampling. A random sample of 5% of the corpus was used for training the predictive algorithm.
  • Data cleaning. Punctuation, white spaces, non printable characters were removed and text was converted to lowercase.
  • Tokenization. Was done using the RWeka package. Unigrams, bigrams, trigrams and quadrigrams were generated and stored in 4 data tables.

Prediction Algorithm

  • The algorithm calculates the probabilities for unigrams, bigrams and trigrams.
  • Smoothing is used to calculate probabilities for n-grams not found in the corpus.
  • Similar steps are performed on the input text. The input text is cleaned (numbers, extra spaces, punctuation are removed) and it is converted to lowercase.
  • The last 3 words from the cleaned input text are used for prediction.
  • The algorithm searches for 3-grams first. If there is no result then it looks for 2-grams, and for 1-gram if there is no 2-gram result.
  • The top five results are returned by the algorithm.

User Interface

Word Predict App

- The app initializes by loading the model from the server.
- The app asks the user for an input sentence.
- Top five predicted words are generated based on input sentence.