Data Science Capstone Coursera

Sofian Hamiti

Overview

The aim of the project is to develop a Shiny app that can predict the next word of a user typing a sentence.

This function is particularly useful for keyboard applications such as Swiftkey.

The prediction model is based on an n-gram model built from 3 very large corpora: Twitter, News, and Blogs.

Due to technology limitations, the model is trained on a subset of the data.

The corpora were provided by HC Corpora. http://www.corpora.heliohost.org/aboutcorpus.html

The Model

The App implements the Naive Bayes model, which has the following advantages:

  • Simplicity
  • can often outperform more sophisticated methods

Reference: http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html

How to use the app

  • Type the sentence for which to predict the next word on the left panel.
  • The first request will take a few seconds as the app loads the ngram file.
  • The app will display the predicted word on the right panel.

Availability

Link: https://sofianhamiti.shinyapps.io/dscc

Future improvements: In the future, this app will include the following improvements:

  • An option to choose different prediction models, such as Kneser-Ney and Back off
  • Top 5 predictions table
  • Better performance and accuracy