Data sience capstone project: next word prediction

jeffzfw
20-8-2015

Overview

The application is the capstone project for the Coursera Data Science specialization held by professors of the Johns Hopkins University and in cooperation with SwiftKey.

The main goal of this capstone project is to build a shiny application that is able to predict the next word.

Tasks of the Project

Some tasks of the project:

  • Data acquisition and cleaning
  • Exploratory analysis
  • Statistical modeling
  • Predictive modeling
  • Creative exploration
  • Creating a data product
  • Make an interactive Slide file

Procedure of project:

  • Read in data and Basic analysis:Read in twitts, news and blogs data.

  • Make corpus, clean the corpus, tokenized corpus, then create termdocumentmatrix for (unigram, 2-gram, 3-gram and 4-gram)

  • Convert tdm to data frame for each gram type(the data frame contains term names, count, and probability that it ocours in the data frame)

  • Create application that do the prediction and return the mostly will be the next word.

Algorithm

  • 2-gram: takes last one word and search through bigram data frame and find the possible terms;
  • 3-gram: take last two words search through the data.
  • 4-gram: take last three words prediction methods.
  • Katz's back-off model: accomplishes the estimation by “backing-off” to models with smaller histories under certain conditions.

How it works

Type in more than one words into the text input box under “Input words please:”,then click submit or press enter key, you will get the next word predication on the right side main panel, followed by the algorithm used to predict.(click to app) alt text