Data Science Capstone Project - Next word prediction

12/6/2017

About the project

This project presents the result of a challenge on the field of Natural Language Proessing(NLP) which is a branch in data science.

We made an application for predicting next word which can be useful for helping text messaging and document writing on mobile phones.

Data Preparation

Swiftkey data files

Because of performance problems we just used 1% of swiftkey data on tweets, blogs and news. We identified that as we are using this app to predict next word for mobile text message it is better to just use tweets data and do not mix blogs and news data. In this case the prediction would be more meaningful.

Processing and cleaning

We made a corpus of data and applied a few filters such as: remove number, remove punctuations, strip whitespaces finally we build unigram, bigram and trigram of the text using term document matrix. Because of performance problems we did not go further and didnot built 4-gram and 5-grams.

Stupid Backoff algorithm

We review most of the algorithms in NLP and found that Stupid backoff is one of the best.

With this algorithm (keep in the mind we have just 3,2 and unigrams) we will just use the last two words entered in the input phrase. This algorithm uses the 3, 2, and unigrams in the order to find the words with the highest score.

In each level it will divide the counts of matched phrase with the previous level matched words. For example if we are using trigram, we will divide the counts by the counts of matched words in bigrams. In the next level down we will apply a 0.4 coefficient.

Word Prediction application

Within the application, we first load the 1,2 and 3 grams data on stratup. This is the reason that application takes a few seconds to launch.

In the left hand side we have prepared a text input for the user. After entering the input phrase, the user should press the button "Predict next word", which then application will apply stupid back off algorithm to predict next word.

Because of performance problems it will take a few seconds to predict the next word. The predicted words are ordered with scores assigned to them with the highest score in the top.