Milestone Report: Next Word Prediction

Author

Chaithra Rai

1. Introduction

The goal of this project is to build a predictive text model that can suggest the next word based on a given phrase. This report summarizes the exploratory analysis of the dataset and outlines the approach for building the prediction algorithm and Shiny application.


2. Data Overview

  • Blogs

  • News

  • Twitter

    These datasets contain natural language text and will be used to train a next-word prediction model.


3. Data Loading


# Load required library
library(stringr)

# Sample data (replace with actual dataset if available)
blogs <- c("This is a sample blog text", "Blogs contain long form content")
news <- c("Breaking news is important", "News articles are informative")
twitter <- c("I love coding", "Twitter has short messages")

4. Basic Statistics

line_counts <- data.frame(
  Source = c("Blogs", "News", "Twitter"),
  Lines = c(length(blogs), length(news), length(twitter))
)

line_counts

5. Word Counts

word_count <- function(text) {
  sum(str_count(text, "\\S+"))
}

word_counts <- data.frame(
  Source = c("Blogs", "News", "Twitter"),
  Words = c(word_count(blogs), word_count(news), word_count(twitter))
)

word_counts

6. Data Exploration

all_text <- c(blogs, news, twitter)
words <- unlist(str_split(tolower(all_text), "\\s+"))
word_lengths <- nchar(words)

hist(word_lengths,
     main = "Word Length Distribution",
     xlab = "Word Length",
     col = "lightblue",
     border = "white")

7. Key Findings

  • Twitter data contains shorter sentences compared to Blogs and News

  • Blogs contain longer and more descriptive text

  • Most words are between 3–7 characters in length

  • The dataset includes both formal and informal language

8. Plan for Prediction Algorithm

The prediction model will use an N-gram approach:

  • Bigram model (uses 1 previous word)

  • Trigram model (uses 2 previous words)

  • Backoff strategy:

  • Try trigram first

  • If not found, fallback to bigram

This approach is efficient and suitable for real-time predictions.

9. Plan for shiny app

The Shiny application will:

  • Take user input (a phrase)

  • Predict the next word

  • Display the result instantly

This will provide an interactive interface for users to test the model.

Conclusion

This exploratory analysis confirms that the dataset is suitable for building a next-word prediction model. The next steps involve improving the prediction algorithm and deploying it through a Shiny application.