Next word forecast model or The first step to the Westworld

Mikalai Dudko
2017.01.01

Idea

In the recent serial called 'Westworld' there was a scene where a robot sees its own speech program and tries to speak outside program's framework. Wouldn't it be fun to emulate this program?

Having limited resource and knowledge it was decided to concentrate on few points:

  • response speed is high priority - next word choice should be forecasted in real time
  • preciseness is lower priority - forecast is given as suggestion of several words

Data

First, the raw datasets were taken. Due to the limited computing capacity 30% was taken of Twitter dataset (~700k of records), 10% of Blogs (~90k), 5% of news (4k) - to catch how the population speaks (Twitter), rather than how rich is English vocabulary (news).

The data was processed and 5-grams (5 words combinations) created; some details:

  • stop words with offensive or no meaning and punctuation were deleted, some words in different forms were unified to one form,5-grams which were including words from different sentences were deleted,
  • phrase table with 5-grams usage frequency was created,
  • no tokenization (division by separate words) was performed

Search algorythm

  1. Process the input phrase the same way as data + tokenization.
  2. The last word of the input is checked for occurrence in the whole phrase table. As it was not tokenized we get a significant increase in processing speed.
  3. Then the filtered phrase table is tokenized and the last input word is checked on the position it holds in 5-grams (last-1). If the word exists in 5-gram there these 5-grams are checked for the second word of the input. If yes, then the last word of 5-gram(s) is saved to the output vector. If not, then last 5-gram words where the last input word holds last-1 position are taken to the output vector. (please remember that phrase table was by default sorted by 5-gram frequency)
  4. The procedure repeats for the second last word of the input in case there is no enough choices in output.

User guide

  • Open the link https://midu.shinyapps.io/CapProject/
  • Either start typing a sentence or paste something to the input field (please be aware about the minimum 5 words and maximum 1400 characters limit)

The suggested words must appear straight away.