FinalProject-DataScienceCapstone

Haowei song
Nov. 25th 2017

Introduction

The objective of the Captone project is to build an application to predict the next word(s) from a partial sententce entered by the user.In this project, a n-gram model combining with both Good-Turing backoff and Kneser-Ney algorithm was used for predicting the next word based on the previous partial sentence.

Raw data files:

en_US.blogs.txt
en_US.news.txt
en_US.twitter.txt

Library used:

plyr
data.table
tm
openNLP
reshape2

Getting & Cleaning the Data

A subset of the original data was sampled from the three sources (blogs,twitter and news) and merged into one corpus
Data cleaning is done by conversion to lowercase, strip white space, and removing punctuation and numbers.
The corresponding n-grams (Quadgram, Trigram, Bigram and Unigram) were then created by using “RWeka” package
The term-count tables are extracted, sorted and presented according to the frequency in descending order.

Here is the link to the milestone report: http://rpubs.com/song_hw/324239

Word Prediction Model

Back-off algorithm:In a backoff N-gram model, if the N-gram we need has zero counts, we approximate it by backing off to the (N-1)-gram. We continue backing off until we reach a history that has some counts. In this project:

Quadgram is first used (If the first three words of Quadgram are the last three words of the user provided sentence).
If no Quadgram is found, back off to Trigram (If the first two words of Trigram are the last two words of the sentence).
If no Trigram is found, back off to Bigram (If the first word of Bigram is the last word of the sentence)
If no Bigram is found, back off to the most common word with highest frequency is returned.

Kneser-Ney algorithm:

create a unigram model as continuation probability
Estimate of the medel on the the number of bigram types it completes
Counting the highest-order N-gram being interpolated

Shiny Application

A shiny app was created. The appp has a textinput box allows users to type in a partial, and a verbatimTextOutput box that will present the top three predicted words when the “next word” action button was clicked.

Below is the screeen shot of the shiny app interface:

plot of chunk unnamed-chunk-1