NLP: Próxima Palavra

agrou
June 7th, 2017

Problem

Aim: Build app that predicts next word based on the user input

    hello how are ...
    how are ...
    are ...
  • Do frequent words have higher probability of being next word?

  • How many paramethers should we consider?

  • How do we adjust for the context in which the word appears?

Model

Probability-based algorithm based on the n-gram model

From 5 to 1 preceding words to predict the next word.

Score based on a Stupid-backoff index or weight of prediction1

n-gram Score
hello how are you 1
how are you 0.4
are you 0.16
you 0.02

Returns list ordered by score: 1st word has highest score

[1]: Brants, T., Popat, A. C., Xu, P., Och, F. J., and Dean, J. (2007). Large language models in machine translation. In EMNLP/CoNLL 2007.

Performance

Corpus sample size App Responsiveness in seconds Training Accuracy Testing Accuracy
10% 0.02 40% 30%

Accuracy: Measured in 99% and 1% of the sample corpus

Accuracy is low compared to latest Swiftkey dashboard developments.

Future developments: The ideal algorithm uses a bigger sample size and learns with the user input. This could require a lot of memory usage that a shiny server is not suitable to handle.

Próxima Palavra App

Check it out here ProximaPalavra

Próxima Palavra App

Features

User experience:

  1. Type something: Type or paste some text
  2. N.º of word predictions: Choose n.º words in the output
  3. Show plot: Graphic visualization of Next word prediction

Server answer:

  1. Reads the last 5 input words & checks them against a text corpus
  2. Next word: N.º words returned are the choice of the user, but it's order is the choice of the model.

Thank you

Coursera's Data Science Specialization community:

  • Professors, Mentors and students!