Predict Next-Word App Data Science Capstone Final Project

Manny Bernabe
04-26-2015

Introduction

  • The majority of the data created today is unstructured (e.g. text, log files). While relative to structured data, unstructured data can be onerous to store and analyze, it offers tremendous insights.

  • Text data is perhaps the most ubiquitous example of unstructured data in today’s digital economy.

  • In this presentation, we explore the usage of text data to build a predict next-word application.

Model Description

  • We built a predict next-word application that utilizes large swaths of raw text data to infer a next-word given a sequence of words.

  • For example, if a user were to type “I love” the app might predict the next-word to be “you”, given instances of “I love” proceed by the word “you” within the raw data.

Model Algorithm

  • To determine predictions the algorithm utilizes the simple back-off n-gram language model.

  • Using text analysis, several n-gram summarization are computed from the raw text data.

  • The algorithm then takes a sequence of words given by a user and references the n-gram tables.

  • It references the n-gram tables, looking for match of user input text starting with the highest level n-gram model and sequential looking in lower level n-gram tables if no match is found.

App Usage

Check out the Shiny App linked phrase. The interface is fairly straight foward. Please see below for instructions below. alt text