Capstone Final Project

Divya Netam

2026-06-25

INTRODUCTION

The objective of this project is to develop a predictive text application capable of suggesting the next word a user is likely to type.

Data Sources:

  1. en_US.blogs.txt
  2. en_US.news.txt
  3. en_US.twitter.txt

The purpose of this analysis is to understand the size, structure, and characteristics of the data and to identify an appropriate strategy for building an efficient next-word prediction model.

Data Exploration

Key findings:

  1. Twitter messages are short and informal.
  2. Blog posts are longer and more varied.
  3. News articles contain more formal language.

Include:

  1. Summary table (file size, line count, word count)
  2. Histogram of words per line

Conclusion: The combined corpus provides a diverse representation of everyday English.

N-Gram Analysis

Text was cleaned and analyzed using N-grams:

  1. Unigrams (single words)
  2. Bigrams (two-word combinations)
  3. Trigrams (three-word combinations)

Prediction Algorithm

The application uses an N-gram language model with a backoff strategy.

Process:

  1. User enters text.
  2. Model searches the 4-gram table.
  3. If no match is found, it backs off to trigrams, then bigrams, then unigrams.
  4. The most frequent matching next word is returned.

Shiny Application

Interactive Text Prediction App

Features:

  1. User enters a phrase in a text box.
  2. Application predicts the next word instantly.
  3. Simple and user-friendly interface.

Future Improvements:

  1. Multiple word suggestions
  2. Improved prediction accuracy
  3. Enhanced user interface

Outcome: A functional predictive text application built from real-world language data.

http://kookie01.shinyapps.io/Shiny/