Coursera Data Science Capstone: Word Prediction App

Matt Dancho
2016-12-27

Predicting the next word in a phrase…

Shiny Web Prediction App

Predict the Next Word

  • Goals: Develop and deploy a word prediction algorithm to the web
  • Results: Shiny Web App with fast yet accurate prediction capabilities

Data Science Workflow

Model Implementation

  • Raw text in the form of news feeds, blogs, and twitter tweets were cleaned and tokenized into n-grams, the sequences of words used to calculate probabilities. Parallel processing along with specialized text mining R packages were used: tm, RWeka, and multidplyr. Two models developed:
1. Simple n-gram backoff and pick highest frequency: Less accuracy, but no internal computation.

2. Stupid backoff with n-gram scoring: Better accuracy, but requires internal computation.

Model 1 was selected based on best combination of accuracy and speed.

Final Results

1000 randomly sampled n-grams from a holdout set were tested. The final model could analyze 1000 samples in about 11 seconds. The overall accuracy was 12.8%. The model tended to perform best on 4-grams (15% accuracy) versus 2-grams (9.3% accuracy).

   user  system elapsed 
   0.72    0.13   10.89 
# A tibble: 3 × 4
      n samples correct   acc
  <dbl>   <int>   <dbl> <dbl>
1     2     323      30   9.3
2     3     369      51  13.8
3     4     308      47  15.3

How to Use

  • Enter a word or phrase in the prediction field

  • Watch as the top predictions are presented in terms of n-gram frequency

  • Try the Word Prediction App

Data Science Workflow