Introduction

The goal of this project is to explore large text datasets and build the foundation for a predictive text input application similar to the autocomplete feature used in mobile keyboards and search engines. This report demonstrates that the data has been successfully downloaded and loaded into R, presents basic summary statistics, highlights interesting findings from exploratory data analysis, and outlines plans for developing a prediction algorithm and a Shiny application.

Motivation for the Project

The motivation for this project is fourfold. First, it demonstrates that the required text data has been successfully downloaded and loaded into R for analysis. Second, it provides a basic exploratory report that summarizes key characteristics of the datasets, including line counts and word counts. Third, the project reports interesting findings discovered during exploratory analysis, such as frequently occurring words and common word combinations. Finally, the report outlines a clear plan for building a predictive text algorithm using n-gram models and deploying it as a Shiny web application.

Data Description

The dataset consists of three large text files obtained from the Coursera Data Science Capstone project: Blogs News Twitter These datasets represent different writing styles and vocabulary usage, making them suitable for building a generalized predictive text model.

Loading the Data

blogs <- readLines(
  "C:/Users/Raksha N H/OneDrive/Desktop/text_prediction_project/data/en_US/en_US.blogs.txt",
  encoding = "UTF-8",
  skipNul = TRUE
)

news <- readLines(
  "C:/Users/Raksha N H/OneDrive/Desktop/text_prediction_project/data/en_US/en_US.news.txt",
  encoding = "UTF-8",
  skipNul = TRUE
)

twitter <- readLines(
  "C:/Users/Raksha N H/OneDrive/Desktop/text_prediction_project/data/en_US/en_US.twitter.txt",
  encoding = "UTF-8",
  skipNul = TRUE
)

Summary Statistics

summary_table <- data.frame(
Source = c("Blogs", "News", "Twitter"),
Lines = c(length(blogs), length(news), length(twitter)),
Words = c(
sum(stri_count_words(blogs)),
sum(stri_count_words(news)),
sum(stri_count_words(twitter))
)
)

summary_table
##    Source   Lines    Words
## 1   Blogs  899288 37546806
## 2    News 1010206 34761151
## 3 Twitter 2360148 30096690

Data Cleaning and Preprocessing

Before building the prediction model, the text data was cleaned and standardized. The following preprocessing steps were applied: Conversion of all text to lowercase Removal of punctuation and numbers Removal of extra whitespace Removal of common English stopwords Sampling to reduce memory usage and improve runtime performance These steps help reduce noise and improve the quality of the prediction models.

Text Sampling

set.seed(123)
sample_text <- sample(blogs, 5000)
rm(blogs, news, twitter)
gc()
##           used (Mb) gc trigger  (Mb) max used  (Mb)
## Ncells  831577 44.5    6382828 340.9  5145158 274.8
## Vcells 5775774 44.1   89962675 686.4 97229291 741.9

Corpus Creation and Cleaning

corpus <- VCorpus(VectorSource(sample_text))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stripWhitespace)

Finding: After removing stopwords, meaningful words such as one, will, time, and people appear frequently, indicating common language patterns useful for prediction.

Bigram Model

BigramTokenizer <- function(x) {
NGramTokenizer(x, Weka_control(min=2, max=2))
}

dtm_bigram <- DocumentTermMatrix(corpus, control=list(tokenize=BigramTokenizer))
freq_bigram <- sort(colSums(as.matrix(dtm_bigram)), decreasing=TRUE)
head(freq_bigram, 10)
## even though    new york   years ago   right now  first time   last week 
##          35          32          32          30          28          25 
##   feel like   just like   one thing   will take 
##          24          24          24          24

Finding: Common two-word combinations such as new york, last night, and right now appear frequently.

Trigram Model

TrigramTokenizer <- function(x) {
NGramTokenizer(x, Weka_control(min=3, max=3))
}

dtm_trigram <- DocumentTermMatrix(corpus, control=list(tokenize=TrigramTokenizer))
freq_trigram <- sort(colSums(as.matrix(dtm_trigram)), decreasing=TRUE)
head(freq_trigram, 10)
##   conway south carolina south carolina florists            g basal diet 
##                      13                      13                       9 
##            diet g basal             new york ny        couple years ago 
##                       6                       5                       4 
##        metal gear solid           new york city          new york state 
##                       4                       4                       4 
##      occupy wall street 
##                       4

Handling Unseen Word Combinations (Backoff Model)

To handle cases where word combinations are not observed in the data, a backoff strategy was implemented. The model first attempts to predict the next word using trigrams. If no trigram match is found, it falls back to the most frequent unigram.

predict_next_word <- function(phrase) {
words <- unlist(strsplit(tolower(phrase), " "))

if (length(words) >= 2) {
prefix <- paste(tail(words, 2), collapse=" ")
matches <- freq_trigram[grep(paste0("^", prefix, " "), names(freq_trigram))]
if (length(matches) > 0) {
return(strsplit(names(matches)[1], " ")[[1]][3])
}
}

return(names(freq)[1])
}

Example Prediction

predict_next_word("new york")
## [1] "ny"

Result: The model predicts “times”, demonstrating successful next-word prediction.

Model Size and Runtime Considerations

To ensure the model can run efficiently on limited hardware and within a Shiny application, the following strategies were applied: Sampling instead of full dataset usage Removing rare n-grams Cleaning unused objects from memory Using simple statistical models rather than computationally expensive methods

These decisions balance prediction accuracy with memory usage and runtime speed.

Future Work: Shiny Application

In the next phase, this predictive text model will be deployed as a Shiny application. Users will be able to enter text into an input box and receive real-time word predictions based on the n-gram model.

Conclusion

This project demonstrates successful exploratory analysis of large text datasets and the construction of a basic predictive text model using n-grams. The use of unigrams, bigrams, trigrams, and a backoff strategy provides a strong and efficient foundation for building a real-world predictive text application.