2025-05-18

Executive Summary

This project aims to develop a next-word prediction model based on large-scale English text datasets. Using n-gram language modeling, the application predicts the most likely next word given a user’s input phrase. The model leverages data from blogs, news articles, and Twitter feeds to capture diverse language patterns.

Due to resource constraints, the current implementation focuses on 2- to 4-gram models, balancing prediction accuracy with computational efficiency and deployment limitations. The final product is a Shiny web application that allows users to input phrases and receive real-time next-word predictions.

This project showcases natural language processing techniques, data cleaning, and predictive modeling, culminating in an interactive app designed for usability and scalability. Future work includes expanding the n-gram range and improving the prediction algorithm for enhanced performance.

Data Management

Data Acquisition

SwiftKey, a company providing text prediction technology, made the dataset for this capstone available. The dataset consist of word combinations from twitter, blogs and news sources in four different languages. I focused on the three English datasets for this capstone

Data Cleaning

• Convert to lowercase

• Remove punctuation and numbers

• Strip whitespace

• Remove stopwords and profanity

N-Gram Modeling

Table: Word count summary per line
File Mean SD Median IQR Max Min
blogs 42 46 28 50 6630 1
news 34 23 31 26 1031 1
tweets 13 7 12 11 47 1

Algorithm steps as follow

• User types a number of words (string) String is cleaned

• If string matches a row in token column of n-gram model, prediction is generated

• If string is not matched, the next lower n-gram model is tried for a match

Backoff Model

In this model

• When a user inputs a word, the model looks through all the n-gram strings and its lesser n-gram strings down (From n= 4 to n = 2),

• The model places a higher weightage on n-gram strings found in higher n as a longer phrase holds more predictive power as compared to a shorter phrase,

• Originally I wanted to use 2 - 6 n-gram models, but the shinyapps.io website size constraint only allowed me to utilize 2 - 4 ngram to optimize space, at the expense of accuracy

Conclusion

This document summarizes the initial stages of the Capstone Project. The next steps involve building a Shiny app for word prediction and deploying it online.

Application is found here