Coursera Data Science Specialization - Capstone Project

Vladimir Djurovic

Capstone project is the last part of Coursera Data Science Specialization. The goal of this project is to apply concepts, tools and techniques learned in previous courses to a real-world problem.

Inroduction and Objectives

Capstone project done in collaboration with SwiftKey. The goal is to create an application to predict next word based on previous user input

For this project, SwiftKey provides data in the form of text collected from various internet sources, like blogs, news sites and Twitter. Data is available in different languages, but for this project, only English is used.

For text processing and building predictive model, some widely popular R packages are used (tm,RWeka, stringi etc.)

Raw text data is available for download from Coursera web site.

Applied Methods

Prediction model is based on n-gram language model. What this means is that original text was cleaned of numbers, punctuations and profanity words, converted to lowercase and then split into tokens called n-grams.

In prediction model, 1-gram, 2-gram and 3-gram tokens are used. This means that expressions of up to 3 words are used for predictions.

For improved efficiency, Markov chain is used to predict next word of user input

Application Usage

Word prediction application has very simple user interface which allows a user to enter some text, and then displays up to 3 possible words which can be used next.

User interface is reactive, which means that suggestions are displayed as user types into text box.

Additional Info