Capstone project - Data Science Specialization
Coursera - Johns Hopkins University
While a user is typing, the SwiftKey Keyboard predicts the next word based on a Text Prediction model.
In this Capstone project a Text Prediction model was developed and an application to demonstrates the working.
The HC Corpus is a large data set with 2.5 million records of Text data from News, Blogs and Twitter.
The Corpus was preprocessed, cleaning it from misspellings, profanity, punctuation, casing, numbers.
Based on the clean data set, a Text Prediction model was created, generating n-grams for predicting the most likely next words.
The Text Predictor application demonstrates how a smart keyboard can predict the most likely next words.
Type some text and click 'Predict', the application will display the 3 words with the highest probability to be typed next.
Click here to go to the application: Text Predictor
The prediction model for this application can be improved in several ways.
A substantial improvement is expected when increasing the training set to a much larger text Corpus. The text Corpus currently used is limited, the more text can be used to train the prediction model on, the more accurate the results will become.