Predict your next word: Courera Data Science Capstone Project

Sora
July 8, 2016

What is this?

This is an algorithm used to predict the next word base on users' text input.

It is the project from Coursera Data Science by Johns Hopkins University. A shiny application was built to test the accuracy of the algorithm.

We are using data sampled from Twitter, News and Blogs. Data from Twitter and News will compose about 80% of the sample, as the languages are more close to daily life conversation.

Methods and Models

The basic model is build N-gram directories. The main method is “In order to do it:

Step 1: clean the sample, (eg: remove numbers, non-English word, decapitalized, etc)

Step 2: break the sample into 1 / 2 / 3 /4 -gram sets

Step 3: build a function which takes users' input and find matches in n-gram sets

Some Improvements

  1. When the algorithm can not find the next word based on the input, it will do a “downgrade” and search the word in the (n-1)-gram.

  2. If users type in inputs that are not contained in the sample sets, the input will be added into the corresponding n-gram dataset.

Check out the app

Important-please give the app a few seconds to search

How to use it:

First select the number of words that the input contains, next type in the inputs. The app will return with prediction of what the next word might be

Link: https://yoke.shinyapps.io/dscapp/