Coursera Data Science Capstone Project: How a trigram model predicts the next word for 2 word phrases

Kim Stacks (SIM KIM SIA)
14th Dec 2014

Please hide toolbars to navigate.

About this project

Main aim of Data Science Capstone Project :

Produce predictive text algorithm in R that based on a user's text input and suggest the next most likely word to be entered

  • Getting and Cleaning Data
  • Reproducible Report
  • Prediction Algorithm
  • ShinyApps
  • R Presentation

Description of the algorithm used to make the prediction

  1. Capture first letter of first word of phrase.
  2. If it is a 2 word phrase, use 3-gram to do search.
  3. Load the right 3-gram database based on the first letter in 1 to return word.
  4. If word found is inside badword list, remove bad word as one possible next word prediction.
  5. Repeats are also removed from result.
  6. Repeat this until we get at least 3 possible words that supposedly comes after the 2 word phrase.

Description of the search algo when cannot find desired sequence

  1. Last written word is divided in half, and a similar word with the first half characters is used for the search.

  2. After 1 still does not return result, last written word is divided in half, and a similar word with the second half characters is used for the search.

  3. After 1 and 2 still do not return result, use only the last 3 characters of last word and search for all database with the same last characters.

Description of app

The App will only accept 2 word phrase and attempt to return a minimal of 3 words that are possibly the next word for the phrase.

Bad words are removed.

Instructions and How it works

Enter any 2 word phrase you find in the twitter or news en_US text files.

You should get an output of 3 words, sorted by relevance.

Do this 5 times.

You should get different results for each attempt.

Links