Introduction

The goal of this project is to build a predictive text model capable of suggesting the next word based on user input. The data used in this analysis comes from blogs, news articles, and Twitter posts contained in the HC Corpora dataset.

Loading the Data

blogs <- readLines("final/en_US/en_US.blogs.txt",
                   encoding = "UTF-8",
                   skipNul = TRUE)

news <- readLines("final/en_US/en_US.news.txt",
                  encoding = "UTF-8",
                  skipNul = TRUE)

twitter <- readLines("final/en_US/en_US.twitter.txt",
                     encoding = "UTF-8",
                     skipNul = TRUE)

Basic Statistics

stats <- data.frame(
  Dataset = c("Blogs", "News", "Twitter"),
  Lines = c(length(blogs),
            length(news),
            length(twitter))
)

stats

##   Dataset   Lines
## 1   Blogs  899288
## 2    News 1010206
## 3 Twitter 2360148

Sampling the Data

set.seed(123)

sampleData <- c(
  sample(blogs, 1000),
  sample(news, 1000),
  sample(twitter, 1000)
)

length(sampleData)

## [1] 3000

Text Cleaning

library(tm)

## Loading required package: NLP

corpus <- Corpus(VectorSource(sampleData))

corpus <- tm_map(corpus, content_transformer(tolower))

## Warning in tm_map.SimpleCorpus(corpus, content_transformer(tolower)):
## transformation drops documents

corpus <- tm_map(corpus, removePunctuation)

## Warning in tm_map.SimpleCorpus(corpus, removePunctuation): transformation drops
## documents

corpus <- tm_map(corpus, removeNumbers)

## Warning in tm_map.SimpleCorpus(corpus, removeNumbers): transformation drops
## documents

corpus <- tm_map(corpus, stripWhitespace)

## Warning in tm_map.SimpleCorpus(corpus, stripWhitespace): transformation drops
## documents

Word Frequency Analysis

tdm <- TermDocumentMatrix(corpus)

m <- as.matrix(tdm)

freq <- sort(rowSums(m),
             decreasing = TRUE)

head(freq, 20)

##   the   and  that   for   you  with   was  have  this   are   but   not  from 
##  4319  2205   990   953   723   599   599   465   448   417   412   376   354 
##  said  will   his   one  they about   all 
##   291   287   279   268   264   264   256

Top 20 Words

barplot(freq[1:20],
        las = 2,
        main = "Top 20 Most Frequent Words")

Findings

The Twitter dataset contains the largest number of text entries, while blog posts contain longer text documents. A small number of words account for a large percentage of the total word usage. These results are consistent with typical natural language datasets.

Prediction Algorithm Plan

The final prediction algorithm will use an n-gram language model. The model will search for matching trigrams first, followed by bigrams and then unigrams using a backoff strategy. This approach balances prediction accuracy with computational efficiency.

Shiny Application Plan

The Shiny application will allow users to enter text and receive next-word predictions in real time. The application will display the most likely predicted word along with alternative suggestions.

Conclusion

This exploratory analysis successfully loaded and examined the HC Corpora dataset. The data has been sampled, cleaned, and analyzed to identify common word patterns. Future work will focus on developing an efficient next-word prediction model and deploying it through a Shiny application.

Data Science Capstone Milestone Report

Ritika Phulwani

2026-06-13