Data Science Capstone project

Nagesh Subrahmanyam
Oct 9th, 2016

Problem Statement

Predict the next word based on an input phrase

  • Applicability in smart phone keyboards to improve typing experience
  • A key-board application will predict the next word when a user has typed some text
  • Prediction based on learning developed out of English language corpus
  • English language corpus available from this site: http://www.corpora.heliohost.org/
  • Corpus has three types of content: Twitter tweets, blog posts and news items

Machine Learning - 1

Develop model to predict next word based on user input

  • Human languages enforce a syntax (grammar) for both spoken and written text
  • The syntax can help arrive at a structure of a sentence
  • The structure being: for some words of a sentence known in advance, there is only a subset of text that can follow
  • The 'some words' are known as n-grams, where, n=2 is a bigram, n=3 is a trigram, etc.
  • Model development follows the same pattern i.e. develop n-grams and arrive at the probable next word.

Machine Learning - 2

Data acquisition and cleaning

  • All data is available in the form of three text files for English language.
  • These three files are for tweets, blogs and news.
  • These files are cleaned of white space, control characters, punctuation.
  • The stop words are not cleaned because regular text uses them heavily.

Tokenizing the text

  • With the cleansed data, we split the data into tokens of a given size.
  • We start with 2 (bigrams) as the number of tokens and go upto 5 (pentagrams).

Store and retrieve

  • The result is stored in a SQLLite data base and queried by Shiny application for results.

How does it work?

Sample walkthrough of a input phrase

  • Consider an input string i love you.
  • Running the application, we get this:
grams gramNumWords totalCount word wordCount score
i love you 3 2775 so 949 0.34198
love you 2 2746 too 1535 0.55899
you 1 288047 are 29052 0.10086
  • The trigram i love you occurred 2275 times and so followed it 949 times.
  • Therefore, the score is 0.341982.
  • If the trigram was not found, then the bigram love you is tested and so on.
  • If a n-gram was not available, it is automatically skipped from the list.
  • Lowering n as we proceed is a means of implementing Stupid Backoff algortihm

REFERENCES

List of references