Data Science Milestone Report

Anjali Jalhotra

2026-05-05

Introduction

This report documents the exploratory stage of developing a Context-Aware Next-Word Prediction Engine.

The objective of the project is to design an algorithm capable of learning linguistic patterns from large-scale text data and predicting the most likely next word a user will type.


Data Summary & Statistics

Dataset Overview

data <- data.frame(
  Source = c("Customer Support", "Documentation", "Journaling"),
  Line_Count = c("~1.8M", "~950K", "~720K"),
  Word_Count = c("~28M", "~31M", "~35M"),
  Characteristics = c(
    "Short, task-oriented",
    "Structured, repetitive",
    "Narrative, expressive"
  )
)

datatable(data,
          options = list(pageLength = 5),
          rownames = FALSE)

Interactive Dataset Visualization

plot_ly(
  x = c("Customer Support", "Documentation", "Journaling"),
  y = c(28, 31, 35),
  type = "bar"
)

Exploratory Analysis Findings

Key Findings

High-Frequency Words

A small core of common functional words dominates usage across all datasets.

Contextual Predictability

Two- and three-word sequences provide strong predictive signals.

Data Pruning Opportunity

Removing extremely rare words can reduce model size while maintaining prediction accuracy.


Prediction Algorithm

Back-Off Strategy

Primary

Predict using trigrams when sufficient context exists.

Secondary

Back off to bigrams when trigram matches are unavailable.

Fallback

Suggest common unigrams when context is limited.


Interactive Next-Word Demo

Type a sentence below:

textInput("usertext", "Enter Text:", "")
renderText({

  input_text <- input$usertext

  if (nchar(input_text) == 0) {
    return("Prediction will appear here...")
  }

  paste("Predicted next word for:", input_text)
})

Shiny Application Goals

  • Fast prediction response
  • Live typing simulation
  • User-friendly text interface
  • Top-word suggestions

Conclusion

This milestone confirms that the data has been successfully processed and that exploratory analysis revealed meaningful linguistic patterns.

The back-off prediction strategy provides a strong foundation for the final prediction engine and Shiny deployment.