Prediction the Next Word

KismetK

This is the final project for Data Science Specialization Capstone Course, by Johns Hopkins University x SwiftKey.

alt text

Background
Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. SwiftKey, our corporate partner in this capstone, a leading software company has built a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models.

Introduction

Overview
The goal of this exercise is to create a product to highlight the prediction algorithm that you have built and to provide an interface that can be accessed by others. A Shiny app that takes as input a phrase (multiple words) in a text box input and outputs a prediction of the next word.

The Final Product
The next word prediction app is hosted on shinyapps.io:
https://kishi.shinyapps.io/predict-next-word/

Working Method

  • Getting and cleaning data
  • Build basic n-gram model
  • Review the dataset
  • Word Prediction Model
  • Bulid the Shiny interface

Data Preparation

The dataset is provided from SwiftKey. We use the english database. To speed up data pre-processing, we built sampling models. Often relatively few randomly selected rows or chunks need to be included to get an accurate approximation to results that would be obtained using all the data.

Using R package tm (Text Mining) to cleanup (Tokenization and Profanity filtering) the data.

Tokenization - identifying appropriate tokens such as words, punctuation, and numbers.

  • converted to plain text
  • converted to lowercase

Profanity filtering - removing profanity and other words you do not want to predict.

  • remove numbers, punctuation, whitespaces

Quadgram,Trigram and Bigram N-grams are created. The objects are saved as R-Compressed files.

Word Prediction Model

Build basic n-gram model

Understand frequencies of words and word pairs - build figures and tables to understand variation in the frequencies of words and word pairs in the data.

Algorithm used to make the prediction

  • Aggregated Quadgram,Trigram and Bigram N-grams are created into 3 major frequency data source.
  • The result data sources are loaded to predict the next word in connections with the text input by user.
  • Search in bigram data frame if there are just 1 number of text in input box
  • Search in tigram data frame if there are just 2 number of texts in input box
  • Search in Quadgram data frame if there are 3 number of texts in input box
  • Text Predict will automatic showen under the text box, users can click on their desired word in the clickable boxes in order to add it to the input box

How the app works (The Usage Of The Application)

Step 1 Type word(s) in the right box

Step 2 You may find some suggested words under the box, select one of them or type your own desired word

Step 3 Keep typing or selecting your ideal words or sentense from the predictions.

alt text