FinalProject-DataScienceCapstone

Haowei song
Nov. 25th 2017

Introduction

The objective of the Captone project is to build an application to predict the next word(s) from a partial sententce entered by the user.In this project, a n-gram model combining with both Good-Turing backoff and Kneser-Ney algorithm was used for predicting the next word based on the previous partial sentence.

Raw data files:

  • en_US.blogs.txt
  • en_US.news.txt
  • en_US.twitter.txt

Library used:plyr. data.tabl, tm, openNLP, reshape2

Getting & Cleaning the Data

  1. A subset of the original data was sampled from the three sources (blogs,twitter and news) and merged into one corpus

  2. Data cleaning is done by conversion to lowercase, strip white space, and removing punctuation and numbers.

  3. The corresponding n-grams (Quadgram, Trigram, Bigram and Unigram) were then created by using “RWeka” package

  4. The term-count tables are extracted, sorted and presented according to the frequency in descending order.

Here is the link to the milestone report:

http://rpubs.com/song_hw/324239

Word Prediction Model

Back-off algorithm:

In a backoff N-gram model, if the N-gram we need has zero counts, we approximate it by backing off to the (N-1)-gram. We continue backing off until we reach a history that has some counts.

Kneser-Ney algorithm:

  1. create a unigram model as continuation probability
  2. Estimate of the medel on the the number of bigram types it completes
  3. Counting the highest-order N-gram being interpolated

Shiny Application

A shiny app was created. The appp has a textinput box allows users to type in a partial, and a verbatimTextOutput box that will present the top three predicted words when the “next word” action button was clicked.

Below is the screeen shot of the shiny app interface:

plot of chunk unnamed-chunk-1