Data Science Capstone Final Report

Zhouyi Wu
October 2016

About the Report

This presentaion is an introduction of a shiny app, which is the project of coursera data science capstone.

The goal of this app is to pedict the next word with a giving phrase.

For more details of this course and project: https://www.coursera.org/learn/data-science-project/.

The final app: https://theoneeno.shinyapps.io/predict_word/

Data

The dataset is from HC Corpora https:www.corpora.heliohost.org

We use three files of this dataset:

[1] "en_US.blogs.txt"   "en_US.news.txt"    "en_US.twitter.txt"

Limited by small memory of my pc, I randomly selected 2000 samples from each objective to build the model.

N-gram and Prediction

The analysis of data including these steps:

  • remove puctations, numbers, white space. Stem words and transform to lower characters, remove stop words. Most work is done by using package “tm”.
  • tranform corpus to three N-grams: unigram, bigram and trigram. This step is using the package “RWeka”
  • In the step of prediction, we first clean phrases as we do in dataset, and the search the tail word in bigram and find the most frequently combination. If there is a tie of frequnce, the frequence of the seconde word decide which one we choose.

We only use onigram and bigram to build prediction, because the sample set we select is small, trigram is not statistically representative.

About the App

To use the app, simply type in the phrase and then click “submit”.

The app will give you the predicted next word.

If your text is not in the bigram model, the app will return the most frequently word in uniram.

Than you for your test