Coursera Data Science Specialization - Capstone project

Jin X.
2018-5-6

Outline of the presentation for the capstone project

The objective
The implementation of the word prediction model
The usage of the application
Performance

For more details of the capstone project please visit https://www.coursera.org/learn/data-science-project/home/welcome

The Objective of the Project

Clean, tokenize text data from the corpora
Create n-grams model from the cleaned dataset
Build a predictive model using backoff n-grams model
Build a shiny application to show the prediction
Deploy and present the shiny application

The Implementation of the Word Prediction Model

Use package quanteda to tokenize the sampled data
Remove numbers, punctuations, symbols, separators, url, hyphens in the corpus
Build term frequency dictionary using data table
Prune the dictionary, discard low frequency terms, Create 5-grams data tables
Search 5 most frequent word candidates, from 5-grams to unigram.
Use stupid backoff scheme to rank next word candidates

The Usage of the Shiny Application

Click my app to test it.

1. Pick a word from the suggestions, or type the words

Press Go button
The input word(s) will be appended to the input sentence, suggestions will be updated with new predictions
Go back to step 1 to continue input

Other functionalities:

Use the Clear button to start a new sentence
The statistics of the prediction are shown at the right corner. To start a new evaluation, press reset counter button
Choose stupid backoff coefficient from the slider at the left corner(optional, default is 0.3)

Performance of the Word Prediction Model

The performance of the prediction model was evaluated by running benchmark.R from https://github.com/hfoffani/dsci-benchmark

Overall top-3 score: 18.86 %

Overall top-1 precision: 13.96 %

Overall top-3 precision: 23.02 %

Average runtime: 9.74 msec

Number of predictions: 28464

Total memory used: 140.07 MB

Dataset details

Dataset “blogs” (599 lines, 14587 words, hash 14b3c593e543eb8b2932cf00b646ed653e336897a03c82098b725e6e1f9b7aa2) Score: 18.13 %, Top-1 precision: 13.04 %, Top-3 precision: 22.46 %

Dataset “tweets” (793 lines, 14071 words, hash 7fa3bf921c393fe7009bc60971b2bb8396414e7602bb4f409bed78c7192c30f4) Score: 19.60 %, Top-1 precision: 14.88 %, Top-3 precision: 23.57 %