Coursera Data Science Specialization - Capstone project

Jin X.
2018-5-6

Outline of the presentation for the capstone project

  • The objective
  • The implementation of the word prediction model
  • The usage of the application
  • Performance

For more details of the capstone project please visit https://www.coursera.org/learn/data-science-project/home/welcome

The Objective of the Project

  • Clean, tokenize text data from the corpora
  • Create n-grams model from the cleaned dataset
  • Build a predictive model using backoff n-grams model
  • Build a shiny application to show the prediction
  • Deploy and present the shiny application

The Implementation of the Word Prediction Model

  • Use package quanteda to tokenize the sampled data
  • Remove numbers, punctuations, symbols, separators, url, hyphens in the corpus
  • Build term frequency dictionary using data table
  • Prune the dictionary, discard low frequency terms, Create 5-grams data tables
  • Search 5 most frequent word candidates, from 5-grams to unigram.
  • Use stupid backoff scheme to rank next word candidates

The Usage of the Shiny Application

Click my app to test it.

1. Pick a word from the suggestions, or type the words

  1. Press Go button

  2. The input word(s) will be appended to the input sentence, suggestions will be updated with new predictions

  3. Go back to step 1 to continue input

Other functionalities:

  • Use the Clear button to start a new sentence

  • The statistics of the prediction are shown at the right corner. To start a new evaluation, press reset counter button

  • Choose stupid backoff coefficient from the slider at the left corner(optional, default is 0.3)

Performance of the Word Prediction Model

The performance of the prediction model was evaluated by running benchmark.R from https://github.com/hfoffani/dsci-benchmark

Overall top-3 score: 18.86 %

Overall top-1 precision: 13.96 %

Overall top-3 precision: 23.02 %

Average runtime: 9.74 msec

Number of predictions: 28464

Total memory used: 140.07 MB

Dataset details

Dataset “blogs” (599 lines, 14587 words, hash 14b3c593e543eb8b2932cf00b646ed653e336897a03c82098b725e6e1f9b7aa2) Score: 18.13 %, Top-1 precision: 13.04 %, Top-3 precision: 22.46 %

Dataset “tweets” (793 lines, 14071 words, hash 7fa3bf921c393fe7009bc60971b2bb8396414e7602bb4f409bed78c7192c30f4) Score: 19.60 %, Top-1 precision: 14.88 %, Top-3 precision: 23.57 %