Word Predictor

Prasun Jha
9th Dec 2018

The word predictor app predicts the next word based on an input from the user. (just like in our smartphone's keyboards)
The user just has to input a phrase and then press the 'Predict' button on the app for the algo to make its prediction.
The predictions are based on an n-gram model which breaks any given sentence into multiple sets of n number of words.

Any machine learning algorithm works on the data that it was built on. This app is based on data obtained from twitter, news and blogs.
1% of the data from the 3 files (news, twitter, blogs) was extracted randomly for further manipulations. A larger percentage of the data could not be extracted due to hardware limitations.
A 3-gram model was made with the extracted data. For e.g. 'I love you loads' would be broken down to 'I love you' and 'love you loads'.
Based on this 3-gram model a set of 3 words were constructed containing the most frequently occuring 3-gram for a particular set of 2 words.
The above was then used to predict the word based on a input phrase from the user.

The algo is case insensitive; i.e. the prediction will not depend on if the input is in caps or small letters
The app has been optimised to prevent any chances of predicting profanities.
The algorithm seems to work well with common phrases even though a large sample could not be used for the predictions

The app can be improved considerably by taking a larger training set. This will involve the use of better hardware.
A 4-gram, 5-gram or a larger n-gram model will also provide better predictions. Again, this requires better hardware for the app to work without frustrating the user.
Using a larger n-gram model the predictions can be made sensitive to the context of the sentence as well.

Every machine learning algorithm works on the assumption that the future can be predicted based on the past. Similarly, the algorithm assumes that the twitter, blogs and news lingo will not change too much in the future
Since only the twitter, blogs and news dataset were used for the prediction, there is an underlying assumption that these 3 files cover the dialect used in typed conversations completely