Word Prediction App

Tarun Kaushik
April 23, 2015

alt text

Part of Capstone of Data Science Specialization by John Hopkins University on Coursera

Introduction

In this project the objective is to predict the word the user is about to type. For this user would have to type a few words and from the words that the user types, a prediction would be made.

The three broad steps involved in the capstone are as follows:

  • Explore the data obtained from course website and provided by HC Corpora.
  • Develop an efficient, fast, andless memory intensive model for prediction.
  • Develop application where the model is deployed and can be used by end user.

Algorithm

Getting and Cleaning data

  • Data was downloaded from course website.
  • Sample the data, randomly selecting 5% of the total number of lines from each file.
  • Merge all the 3 sapled datasets, blogs, news, and twitter in one single dataset.
  • Replace short forms such as 're with are, n't with not, 've with have etc.
  • Replacing all special characters, numbers, etc. with space.
  • Replacing multiple spaces with a single space.
  • Turning whole data to lower case.
  • Spliting all the data using space and storing them as tokens.
  • For each word, checking for presence in the dictionary and removing profane words.

Prediction

After tokenization, final dataset was created which contained

  • 3-gram: a set of three preceeding words
  • 2-gram: a set of two preceeding words
  • 1-gram: one preceeding word
  • Word frequency: frequncy of words present in sample

A function taking a string as the input used the data from the above four datasets, and arranged the predictions in decreasing order of probability, for each dataset in aforementioned order.

From the list of predictions a maximum of 10 predictions were displayed. The number of predictions displayed depends on input by the user.

Performance of algorithm

To evaluate the performance of App, the left over data which was 95% of the total data obtained from the course website was used. There are two parameters which were checked.

  • The following word was matched with the first prediction.
  • The following word was matched with the top 10 predictions.

Following are the results:

  • First prediction match: 14.71%
  • Top 10 predictions match: 57.83%

Another version of model with 70% of whole data used for delevopment and 30% of the data for validation gave the following results:

  • First prediction match: 18.33%
  • Top 10 predictions match: 74.94%

However in the App 5% data for development was used as there were memory and run-time constraints.

The App

  • The App is easy to use.
    • The user can enter, or copy paste a phrase into the text box.
    • The user can also select the number of predictios that the App should make.
  • The results are displayed with decreasing probability of occurence of the word.
  • The App might take some time to load initially, a few seconds at max.
  • The App also contains links to
    • Documentation
    • Twitter profile of the auther
    • Milestone Report Rubric
    • And this slide deck

alt text

Please use the app at https://tarunkaushik.shinyapps.io/ShinyCapstone/

Thanks you!