Mikalai Dudko
2017.01.01
In the recent serial called 'Westworld' there was a scene where a robot sees its own speech program and tries to speak outside program's framework. Wouldn't it be fun to emulate this program?
Having limited resource and knowledge it was decided to concentrate on few points:
First, the raw datasets were taken. Due to the limited computing capacity 30% was taken of Twitter dataset (~700k of records), 10% of Blogs (~90k), 5% of news (4k) - to catch how the population speaks (Twitter), rather than how rich is English vocabulary (news).
The data was processed and 5-grams (5 words combinations) created; some details:
The suggested words must appear straight away.