SmartType

Fred Smith
September 2016

SmartType is a predictive typing application that predicts the next word based on previous words.

This is a capstone project for the Coursera Data Science Specialization offered in collaboration with the Johns Hopkins Bloomberg School of Public Health.

SmartType Architecture

Development Environment

RStudio on Windows 10 with 16GB
Data in text files, one document per line
- Blogs - 900K docs / 37M words
- News - 77K docs / 2.6M words
- Tweets - 2.3M docs / 30M words
Reduce.Rmd - Sample train and test data
Model.Rmd - Preprocess and model data
Evaluate.Rmd - Validate predictive performance

SmartType Architecture

Deliverables

Shiny app with interactive and copy/paste UIs
Report on trained model vs. test data

Pre-Processing and Modeling

Pre-Processing Raw Data

Reduce random samples due to processing limits (scalable)
- 10% of documents for training/modeling
- 1% of documents for testing/validation
Basic scrubbing
- Remove numbers
- Remove punctuation
- Convert to lower-case
- Remove extra white space

Modeling / Training

Uses tm and RWeka packages for analysis
Vector for each N-gram (N in 1:4)
Sorted descending by count
Elements named by N-gram
Four vectors in one list, indexed by N
Store list/model in file for transfer

Algorithm - predict(context,N,hint)

Input

context - string containing previous N words
N - number of words in context
hint - keystrokes starting the next word

Output

vector of suggestions for next word
ordered by decreasing frequency

Algorithm

Initialize model from file
Loop (k in 4:1) until at least 5 next terms found
- Search k-gram vector by name
Filter results that begin with hint

Since predict() is called successively with the same context for multiple hints, k-gram search results are cached to improve performance.

Application and Performance

Shiny App with Two UIs

Interactive (as on a cell phone)
- Buttons under input suggest up to 5 words
- Wordcloud shows larger set of suggestions
- Timings to judge algorithm vs. network lag
Copy/Paste
- Copy/paste any text and submit
- Scans left to right
- Successive predictions vs. next word
- Displays performance statistics

Performance Evaluation

evaluate(docs) - For a vector of documents
Collects and reports performance statistics

Performance Report Here

45% of words are among top 5 suggestions
24% of keystrokes could be saved
Each prediction takes about 300 ms (interactivity is hindered by network performance)

SmartType

SmartType Architecture

Development Environment

Deliverables

Pre-Processing and Modeling

Pre-Processing Raw Data

Modeling / Training

Algorithm - predict(context,N,hint)

Input

Output

Algorithm

Application and Performance

Shiny App with Two UIs

Performance Evaluation

Performance Report Here

Try SmartType Here