Data Science Capstone Project

Geons
March 2017

  • This introduces an application that uses Natural Language Processing techniques for predicting the next word.
  • The Shiny application can be visualized at this link
  • The Data Science Specialisation is jointly organised by Coursera and Johns Hopkins University
  • Data for this Capstone project is kindly offered by SwiftKey.

OBJECTIVE

  • A Shiny app that can predict the next word was to be developed. This is not unlike the apps we use daily in our mobile keyboards implemented companies like Swiftkey.

  • This app is meant to showcase what was learnt during the 9 courses of the Data Science specialisation.

  • The initial tasks like obtaining and cleaning data are documented in a Milestone Report. The creation of N-grams is also described there.

METHODS AND APPROACHING

  • After preparation the data was sampled and broken down (tokenized) into contiguous sequences of N words, the so-called N-grams.

  • These N-grams were then analysed and form the base for the predictive model. A lot of care was taken in their production process.

  • The predictive model uses the Katz Backoff model. In essence this means that if the n-gram has been seen more than k times in training, the conditional probability of a word given its history is proportional to the maximum likelihood estimate of that n-gram. Otherwise, the conditional probability is equal to the back-off conditional probability of the “(n - 1)-gram” (Wikipedia).

THE SHINY APPLICATION

  • The Shiny application allow the prediction of the next possible word in a sentence.

  • The application loads and the user is directly offered a view of the input text box and a user guide alongside.

  • After switching tabs the user can see the most probable prediction and the runners up.

  • In a third and final tab more background information is given.

Figure

ADDITIONAL COMMENTS AND LINKS