SmartType

Fred Smith
September 2016

SmartType is a predictive typing application that predicts the next word based on previous words.

This is a capstone project for the Coursera Data Science Specialization offered in collaboration with the Johns Hopkins Bloomberg School of Public Health.

SmartType Architecture

Development Environment

  • RStudio on Windows 10 with 16GB
  • Data in text files, one document per line
    • Blogs - 900K docs / 37M words
    • News - 77K docs / 2.6M words
    • Tweets - 2.3M docs / 30M words
  • Reduce.Rmd - Sample train and test data
  • Model.Rmd - Preprocess and model data
  • Evaluate.Rmd - Validate predictive performance

SmartType Architecture

Deliverables

  • Shiny app with interactive and copy/paste UIs
  • Report on trained model vs. test data

Pre-Processing and Modeling

Pre-Processing Raw Data

  • Reduce random samples due to processing limits (scalable)
    • 10% of documents for training/modeling
    • 1% of documents for testing/validation
  • Basic scrubbing
    • Remove numbers
    • Remove punctuation
    • Convert to lower-case
    • Remove extra white space

Modeling / Training

  • Uses tm and RWeka packages for analysis
  • Vector for each N-gram (N in 1:4)
  • Sorted descending by count
  • Elements named by N-gram
  • Four vectors in one list, indexed by N
  • Store list/model in file for transfer

Algorithm - predict(context,N,hint)

Input

  • context - string containing previous N words
  • N - number of words in context
  • hint - keystrokes starting the next word

Output

  • vector of suggestions for next word
  • ordered by decreasing frequency

Algorithm

  • Initialize model from file
  • Loop (k in 4:1) until at least 5 next terms found
    • Search k-gram vector by name
  • Filter results that begin with hint

Since predict() is called successively with the same context for multiple hints, k-gram search results are cached to improve performance.

Application and Performance

Shiny App with Two UIs

  • Interactive (as on a cell phone)
    • Buttons under input suggest up to 5 words
    • Wordcloud shows larger set of suggestions
    • Timings to judge algorithm vs. network lag
  • Copy/Paste
    • Copy/paste any text and submit
    • Scans left to right
    • Successive predictions vs. next word
    • Displays performance statistics

Performance Evaluation

  • evaluate(docs) - For a vector of documents
  • Collects and reports performance statistics

Performance Report Here

  • 45% of words are among top 5 suggestions
  • 24% of keystrokes could be saved
  • Each prediction takes about 300 ms (interactivity is hindered by network performance)

Try SmartType Here