17 June 2018

Summary

This project is about word prediction: predict the next word in a sentence, knowing the previous words. It uses machine learning technics to train an algorithm in a specific natural language (in our case English). A web version of it that can be tested at https://rosengurtt.shinyapps.io/swiftkeywordprediction3/

In our tests the right word is predicted aproximately 17% of the time. The other 83% the word predicted is a good candidate, in the sense that it makes sense and the phrase is a perfectly valid one. When the user types partially a word, the tool uses the letters typed so far of the new word to make a better prediction. No word predictor can have a 100% accuracy or even close to that value. Otherwise, there would be no point in reading anything, because there would be nothing new to say. So we consider that our 17% hit rate is not as bad as it sounds.

How the algorithm works

The traditional way to implement the prediction of words is to analize large sets of sample text to calculate the probabilities of having word X after the sequence of words Y…Z. Once we know those probabilites, we predict the next word as the one that has the higher probability to come next, given the previous words in the provided sentence.

So basically our algorithm does that, but with a twist. The problem with the approach just described, is that calculating the frequencies for word sequences longer than 5 or 6 becomes inpractical. This is because the number of possible combinations grows exponentially with the quantity of words. There is a compromise between the accuracy of prediction and the size of the data the application needs to work. The more data it has, the better the prediction, but the longer it takes to predict, and the more computing power it needs.

Using only the last few words in the sentence to predict the next one is suboptimal because it ignores some of the data provided, and this information could be relevant. To tackle this problem, we added a feature to our algorithm: it tries to detect the subject matter of the input text, and it then applies a set of probabilities taylored for that subject.

The assumption is that the words you will likely use, are dependent on the subject you are talking about. If the algorithm can detect the subject matter, and it has appropriate statistics for each subject, it is expected that it will make better predictions.

To decide what is the subject matter is looks for specific words that are more frequent in a specific subject than in others. For example we found that "political", "government", "public", "war" and "state" happen more frequently when talking about politics than when talking about other subjects.

To train our algorithm, we downloaded wikipedia pages because they are copyright free and are subject specific. We used an "Export Pages" functionality provided by Wikipedia: https://en.wikipedia.org/wiki/Special%3aExport. We then removed the metadata using a free tool called "WikiExtractor" available at https://github.com/attardi/wikiextractor. We selected the following 4 subjects: computing, math, politics and health.

Our online version of the tool uses only the last 2 words to predict the next, to keep the size of the RAM below 1 Gb, that is the maximum a free account can use. In our tests, we used up to 4 previous words to predict the next. The source code of the tool and the tests carried to evaluate it are available at https://github.com/rosengurtt/capstoneDataScience

Using the tool

Type or paste text in the box provided. The predicted next word and the subject detected will be shown below. The number of previous words used for the prediction is also shown.