Page 1: Overview and Motivation

  • This project builds a next-word prediction app using the SwiftKey dataset (blogs and news).
  • The goal is to create a lightweight, fast, and reasonably accurate prediction model using NLP tools in R.
  • Due to memory/runtime limitations:
    • Only 50% of blogs and news data was used.
    • Twitter data was excluded (too noisy, informal, and interfered with clean predictions).

Page 2: Prediction Algorithm

  • The model is based on an n-gram backoff algorithm:
    • Uses trigramsbigramsunigrams when higher-order matches are unavailable.
    • No smoothing beyond backoff logic.
  • Key Features:
    • Text is cleaned: punctuation, numbers, stopwords removed.
    • Tokenization is performed using tidytext.
    • Backoff is fast and interpretable.
  • Despite its simplicity, the model achieves a perplexity of 35, indicating strong predictive performance on clean input.

Page 3: App Overview

  • The Shiny app takes a user-entered phrase and returns predicted next word(s).
  • Features:
    • Real-time predictions with a simple UI.
    • Displays the matched n-gram tier (tri-, bi-, or unigram).
  • Instructions:
    1. Type in a phrase like “The sun is”.
    2. Press “Predict” — the app suggests next word(s).
    3. You can repeat and build phrases interactively.

Page 4: Design Tradeoffs & Limitations

  • Runtime tradeoff: Only half the corpus used for faster loading.
  • Model simplicity: No deep learning; just n-grams and backoff.
  • Introduced noise: Some random n-grams added to test generalization — this slightly hindered top-1 accuracy but improved coverage.
  • Exclusion of Twitter: Intentional, as its informal language hurt precision.
  • However:
    • The app achieves high accuracy.
    • Lightweight, low-perplexity design suitable for mobile or embedded use.

Page 5: Reflections and Evaluation

  • The app is simple, fast, and robust.
  • The algorithm, though basic, delivers high predictive power.
  • User experience is smooth — clean interface, responsive predictions.
  • Novelty:
    • Purposefully excluded noisy data and added adversarial n-grams.
    • Achieved balance between speed and accuracy using classic methods.

I would definitely hire this person to build scalable, reliable NLP prototypes at my data science startup.