Data Science Capstone Project

Sunday, April 26, 2015

Introduction

As we move into an era where the majority of human interaction is done via computer text, there is a growing demand to make this easier. Additionally, as computers grow in speed and memory, the ability to parse through vast amounts of information quickly is forming the perfect storm for a text prediction application.

This app, created for the Coursera Data Science Capstone project, allows a user to enter some text and the next word is predicited. This application helps by reducing the amount of typing required when communicating. I hope you enjoy it!

The Data and Preproccesing

The data for this project was provided by HC Corpora. There were three files, one with texts from News sources, one with texts from blogs, and the last with text from Twitter or tweets.

Since the data files were rather larger, only a random sample from all three files were used in the development of the application. Once the data was loaded into R, the text was turned into words, then tokenized. After concatenating the three files into one, 2,3, and 4 grams were created from the data. This is the first steps of creating the nextWord application.

Using the App

The nextWord is very simple to use. Once it opens, simply type some text into the text box. Your predicited next word is then shown below. As you add or delete text, the app automatically updates. Very simple!

The Algorithm

Below are the steps on the nextWord predicitive algorithm:

After the text is enter, the algorithm turns the text into words and determines how many words are entered.
The last word/words entered are grabed and compared against the pre-made n-grams
When the words are found in the respective n-gram, the most common next word is selected
The selected word is displayed as the predicition!