December 14, 2014

OverView

Natural Language Generation (NLG)

  • Subset of Natural Language Processing
  • Utilizes Grammars and Statistical Models
  • Extracted from Human Written Texts

SwiftKey®

  • Uses NLG to Predict Next Word as User Types

Task

  • Extract and Build Datasets from Representative Text
  • Model that Text to Simulate SwiftKey’s Process
  • Recreate this Functionality in R

Analysis

Provided Datasets

  • 4 Languages - German, English, Finnish and Russian
    Focus is on English Text for this Project
  • Each further broken down into News, Blogs and Twitter Feeds

Descriptive Data Analysis

  • 100 Million Words
  • Varying Word Choices between Feeds
  • For instance, "Dimora"" is only seen in the News Feed

Detailed Report: [https://rpubs.com/yxes/40858]

Preprocessing

Individual Word Extraction

  • Convert Line to Lower Case
  • Remove Non A - Z characters, Spaces or Apostrophe
  • Exception: Twitter Retains # Symbol
  • Clean up Multiple Apostrophe’s or Hash Marks(#)
  • Remove Multiple Spaces
  • Split the Line into a Series of Words Separated by Spaces

Grams

  • Groups of Words (n-grams) Counted from Resulting Lines
  • Groups Consisting of 1-4 Words Counted and Ranked
  • Tables Created with the Highest Word Count of a Given Group

Prediction Algorithm

Processing

  • Based on Modified Markov Model
  • Locate Optimal Sequence of Tags for a Given Word Sequence
  • Last Four Words are Extracted from Input String
  • This Combination is Searched in the Quadgram Table
  • If Found - Return the next Word in the List
    • Not Found - Use the latest 3 words and Perform Search of Trigram table
  • Loop through Each Table until there’s a Match
  • If no Match, Return the Most Common English Word “the”