Milestone Report: Next Word Prediction

Author

Chaithra Rai

1. Introduction

The goal of this project is to build a predictive text model that can suggest the next word based on a given phrase. This report summarizes the exploratory analysis of the dataset and outlines the approach for building the prediction algorithm and Shiny application.

2. Data Overview

Blogs
News
Twitter

These datasets contain natural language text and will be used to train a next-word prediction model.

3. Data Loading


# Load required library
library(stringr)

# Sample data (replace with actual dataset if available)
blogs <- c("This is a sample blog text", "Blogs contain long form content")
news <- c("Breaking news is important", "News articles are informative")
twitter <- c("I love coding", "Twitter has short messages")

4. Basic Statistics

line_counts <- data.frame(
  Source = c("Blogs", "News", "Twitter"),
  Lines = c(length(blogs), length(news), length(twitter))
)

line_counts

5. Word Counts

word_count <- function(text) {
  sum(str_count(text, "\\S+"))
}

word_counts <- data.frame(
  Source = c("Blogs", "News", "Twitter"),
  Words = c(word_count(blogs), word_count(news), word_count(twitter))
)

word_counts

6. Data Exploration

all_text <- c(blogs, news, twitter)
words <- unlist(str_split(tolower(all_text), "\\s+"))
word_lengths <- nchar(words)

hist(word_lengths,
     main = "Word Length Distribution",
     xlab = "Word Length",
     col = "lightblue",
     border = "white")

7. Key Findings

Twitter data contains shorter sentences compared to Blogs and News
Blogs contain longer and more descriptive text
Most words are between 3–7 characters in length

The dataset includes both formal and informal language

8. Plan for Prediction Algorithm

The prediction model will use an N-gram approach:

Bigram model (uses 1 previous word)
Trigram model (uses 2 previous words)
Backoff strategy:
Try trigram first
If not found, fallback to bigram

This approach is efficient and suitable for real-time predictions.

9. Plan for shiny app

The Shiny application will:

Take user input (a phrase)
Predict the next word
Display the result instantly

This will provide an interactive interface for users to test the model.

Conclusion

This exploratory analysis confirms that the dataset is suitable for building a next-word prediction model. The next steps involve improving the prediction algorithm and deploying it through a Shiny application.