2025-10-24

Reading Data

I had three files for US language, corresponding to Blogs, Twitter and News. First I read these files and build a simple statistics on them as follows.

.

Then I splited all three files in quantiles and kept only observations between quantile 25 and 75 to reduce the data size to half and appended all data (blogs, twitter and news) in one file.

Cleaning Data

Then I applied cleansing on the resulting data:

  • removed URL, email addresses, Twitter handles and hash tags
  • removed ordinal numbers
  • removed bad words according to downloaded file
  • removed punctuation
  • and trim leading and trailing whitespace

.

Model / Creating Ngrams -1

Using previous data I created four files:

  • First word prediction file with most frequent words. It will be used to predict first word if there is no input from user yet
  • BIGRAM file containing all ngrams of order 2 (two words)
  • TRIGRAM file containing all ngrams of order 3 (three words)
  • QUADGRAM file containing all ngrams of order 4 (four words)

NGRAMS files contain following columns:

  • full text - containing the text with predicted word
  • n - number of occurrences
  • percent - percentage of occurrences
  • predicted - predicted word for entered text
  • text - entered text

Model / Creating Ngrams -2

NGRAMS:

.

.

Model / Creating Ngrams -3

NGRAMS:

.

.

Creating shiny app

Shiny app use those N-Grams files to lookup the predicted word (predicted column) from the user input text (text column). A note that user input is cleansed from same situations as explained in slide 3.The app looks like:

.