El Grueff - A.S.
9. August 2019
The capstone of coursera's data science specialisation asks you to come up with an app which predicts your next word, based on what your are writing.
Here you can find my version of this app. It looks basic, but it does the task.
Of course it could be made to look better and it could also be made to be a bit more useful (like for instance with a button wich lets you actually paste what you wrote plus the prediction) but that will be something I'll tackle outside of this course.
The dataset we were given consisted of a whole lot of text from news articles, blogs and texts from all over the internet.
The algorithm takes a random sample of this text (10 percent in this case), tokenizes it (meaning separates the words from each other), cleans the set (no profanities wanted) and creates so called n-grams (chunks of n words, who occur one after the other). The profanities are the kind of known seven words you should never say on television. (I wont write them down here, just google it.)
Given the last n words the algorithm takes the (n-1)-gram and looks up the most probable next word. If nothing is found, then the algorithm takes the last n-2 words and the (n-2)-grams und looks if he finds something there. If nothing is found, even in 2-grams, then one of the 10 most used words in the sample are given.
In this version of the app, n goes up to 4, everything before won't be considered.
I didn't built a submit button or anything like that, because I thought about how I would need such an app and I would need it (like basically the swiftkey keyboard) to automatically and continuosly give me prediction, which is what it does now.
This version of the algorithm actually got 45% (9 out of 20 samples in the test) right. That does not sound like a lot, but in real life such an algorithm would need a lot of training.
I tend to think that the questions were very “overfitted”, since they seemed to be very “known” phrases and sentences. This version of the app doesn't works with those wordgroups (like for instance “offense”-“defense” or “Football” - “player”). That would be one idea, of how the prediciton could be improved.
I have been thinking about more ways to improve the predictio. Of course you could provide more n-grams, but you could also try to work with word-clusters and try to make the algorithm actually “learning” by updating with the words and phrases a user inputs.