NLP: Próxima Palavra

agrou
June 7th, 2017

Problem

Aim: Build app that predicts next word based on the user input

    hello how are ...
    how are ...
    are ...

Do frequent words have higher probability of being next word?
How many paramethers should we consider?
How do we adjust for the context in which the word appears?

Model

Probability-based algorithm based on the n-gram model

From 5 to 1 preceding words to predict the next word.

Score based on a Stupid-backoff index or weight of prediction¹

n-gram	Score
hello how are you	1
how are you	0.4
are you	0.16
you	0.02

Returns list ordered by score: 1st word has highest score

[1]: Brants, T., Popat, A. C., Xu, P., Och, F. J., and Dean, J. (2007). Large language models in machine translation. In EMNLP/CoNLL 2007.

Performance

Corpus sample size	App Responsiveness in seconds	Training Accuracy	Testing Accuracy
10%	0.02	40%	30%

Accuracy: Measured in 99% and 1% of the sample corpus

Accuracy is low compared to latest Swiftkey dashboard developments.

Future developments: The ideal algorithm uses a bigger sample size and learns with the user input. This could require a lot of memory usage that a shiny server is not suitable to handle.

Próxima Palavra App

Check it out here ProximaPalavra

Próxima Palavra App

Features

User experience:

Type something: Type or paste some text
N.º of word predictions: Choose n.º words in the output
Show plot: Graphic visualization of Next word prediction

Server answer:

Reads the last 5 input words & checks them against a text corpus
Next word: N.º words returned are the choice of the user, but it's order is the choice of the model.

Thank you

Coursera's Data Science Specialization community:

Professors, Mentors and students!