Data Science Project - Word Prediction App

Adrian R. Angkawijaya

June 2018

About the Project

This project is the final Capstone Project of the John Hopkins Data Science Specialization Program, hosted by Coursera in collaboration with SwiftKey. This application is a simulation of SwiftKey’s text input app mostly seen on smartphones text messaging or web search sites where next words are predicted when a user input or type in a word.

The coding and algorithms for the project are all done in R. Natural Language Processing techniques are implemented to do the text data mining and the prediction. RWeka, tm and stringi are some useful Natural Language Processing packages that were developed in R and are used in the project. Check out this wikipedia link to learn more about NLP.

Approach and the Algorithm

The following cleaning activities are computed to the data before we create the model:

The model was then created using the algorithm of N-grams model. Five N-grams tokens were created (unigram, bigram, trigram, fourgram, fivegram) and were transformed into frequency data frames. The model are then able to predict next word based on the corresponding n-gram frequencies. A more technical background about the technique is available to see on the wikipedia page here.

The Application

The application itself is very simple to use. The user enter or type in any length of words in the first box, the second box will then predict and show the next word automatically every time a new word is inputted. An example of how it works can be seen below:

Additional Information

Enjoy and have fun with the App!