Data Science Capstone Predictive Text

December 14, 2014

OverView

Natural Language Generation (NLG)

Subset of Natural Language Processing
Utilizes Grammars and Statistical Models
Extracted from Human Written Texts

SwiftKey®

Uses NLG to Predict Next Word as User Types

Task

Extract and Build Datasets from Representative Text
Model that Text to Simulate SwiftKey’s Process
Recreate this Functionality in R

Analysis

Provided Datasets

4 Languages - German, English, Finnish and Russian
Focus is on English Text for this Project
Each further broken down into News, Blogs and Twitter Feeds

Descriptive Data Analysis

100 Million Words
Varying Word Choices between Feeds
For instance, "Dimora"" is only seen in the News Feed

Detailed Report: [https://rpubs.com/yxes/40858]

Preprocessing

Individual Word Extraction

Convert Line to Lower Case
Remove Non A - Z characters, Spaces or Apostrophe
Exception: Twitter Retains # Symbol
Clean up Multiple Apostrophe’s or Hash Marks(#)
Remove Multiple Spaces
Split the Line into a Series of Words Separated by Spaces

Grams

Groups of Words (n-grams) Counted from Resulting Lines
Groups Consisting of 1-4 Words Counted and Ranked
Tables Created with the Highest Word Count of a Given Group

Prediction Algorithm

Processing

Based on Modified Markov Model
Locate Optimal Sequence of Tags for a Given Word Sequence
Last Four Words are Extracted from Input String
This Combination is Searched in the Quadgram Table
If Found - Return the next Word in the List
- Not Found - Use the latest 3 words and Perform Search of Trigram table
Loop through Each Table until there’s a Match
If no Match, Return the Most Common English Word “the”