SmartKeyBoard App for the Capstone Project

Mykyta Zharov
12.03.2020

Overview

This is a presentation of the next word prediction app, which was built for the final Capstone project of the Coursera JHU Data Science Specialisation. In this presentation the following topis will be briefly described:

Data preprocessing and cleaning
Model building
App functionality

App:https://mykytazharov.shinyapps.io/SmartKeyBoardApp/ Milestone report:https://rpubs.com/kitazharov/573608

Data preprocessing and cleaning

There were 3 text datasets given, which contained tweets(167mb), blogs(210mb) and news posts(205mb). All datasets were combined and 1% of the data was randomly taken to build a prediction model. The following starting steps were performed:

Dataset was splitted in to train(75%) and test(25%) datasets.
Text corpus was build on 1% of the train dataset.
All texts were lowercased. Numbers, emails, urls, profanity words were deleted.
Punctuations, apostrophes were removed.
Unnecessary whitespaces were deleted.

Model building

The prediction model, that was used in the application, was built with 4-gram language model using stupid backoff algorithm. More information can be found via the following links:

n-gram language model - https://en.wikipedia.org/wiki/N-gram
stupid backoff algorithm - https://www.aclweb.org/anthology/D07-1090.pdf

After the model was built, it was tested on the 4-grams from the test dataset. The resulting accuracy of the algorithm was around 20%.

App functionality

Sniny application consists of two pages:

Info

On the Info page the user can find initial information about the model and purpose of the app, as well as instructions how to use the app.

The App page itself has an text input field, where the user can type text. The 3 best predictions will be visualised as buttons under the text input. A user can click on button with prediction and it will be added at the end of the typed text.