Word Prediction

2025-10-24

Reading Data

I had three files for US language, corresponding to Blogs, Twitter and News. First I read these files and build a simple statistics on them as follows.

Then I splited all three files in quantiles and kept only observations between quantile 25 and 75 to reduce the data size to half and appended all data (blogs, twitter and news) in one file.

Cleaning Data

Then I applied cleansing on the resulting data:

removed URL, email addresses, Twitter handles and hash tags
removed ordinal numbers
removed bad words according to downloaded file
removed punctuation
and trim leading and trailing whitespace

Model / Creating Ngrams -1

Using previous data I created four files:

First word prediction file with most frequent words. It will be used to predict first word if there is no input from user yet
BIGRAM file containing all ngrams of order 2 (two words)
TRIGRAM file containing all ngrams of order 3 (three words)
QUADGRAM file containing all ngrams of order 4 (four words)

NGRAMS files contain following columns:

full text - containing the text with predicted word
n - number of occurrences
percent - percentage of occurrences
predicted - predicted word for entered text
text - entered text

Model / Creating Ngrams -2

NGRAMS:

Model / Creating Ngrams -3

NGRAMS:

Creating shiny app

Shiny app use those N-Grams files to lookup the predicted word (predicted column) from the user input text (text column). A note that user input is cleansed from same situations as explained in slide 3.The app looks like: