Introduction

In many applications or devices when we type the text there is the option to suggest the next word providing several options. Sometimes it helps, sometimes it makes you crazy …

If we imagine ourselves predicting next word based on previous input words, we understand that it is enormously difficult task. “my dog is ….” (running, jumping, sleeping …?), “he is a …” (driver, husband, boy …?)

Two important things that are clear from mental simulation of words prediction

Overview

The objective of the project is to build the model that predicts the next word based on previous input text and realize it in Shiny application.

Model limitations

  • It should run on mobile devices to predict the next word in typing process, so it should not be “heavy” in terms of memory (several dozens of MB as a maximum) and in terms of computing (computing time should be “immediate”, less than a second).
  • It should be based on Ngram approach (there are more advanced generative AI models).

Project steps

  1. Obtain the dataset. Divide it into training, validation, testing sub-sets.

  2. Clean and analyse the training data.

  3. Build the optimal model for word prediction.

     3a. Identify the model options.
     3b. Build and “train” the models based on training data.
     3c. Evaluate the model based on cleaned validation data.
  4. Evaluate the selected model based on cleaned testing data.

  5. Deploy it in Shiny server for users to try.

Questions to consider

  • What part of provided data set to use?
  • How to clean the data?
  • How can we optimize the N-gram model in terms of volume?
  • What are the options of prediction models?
  • What is the required number of words as the input to make prediction?
  • What is the model accuracy?

Data

Loading data

The data was loaded from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

Data includes 3 files: blogs, news and twitter.

Data analysis

files lines words size_MB
en_US.blogs.txt 899288 37546250 268
en_US.news.txt 77259 2674536 21
en_US.twitter.txt 2360148 30093372 334

As we see the provided data has high number of lines and words, large volume size in MB.

That means that we will need to use only the small part of data to construct the model. The good side is that we have a lot of data for validation and testing purposes.

Moving forward based on plan above we will need to make important decisions on scope of data to use, data splitting into training, validation and testing sub-sets and on data cleaning.

Create data subsets

Initial data sets are divided into 3 non-overlapping parts: 60% training, 20% validation, 20% testing.

Number of lines in data sets

Source training validation testing
blogs 539572 179857 179857
news 46355 15451 15451
twitter 1416088 472029 472029

Still these parts are too large to use. We are taking randomly small parts (3-5%) from all data-sets and will use them in the project.

Smaller data sub-sets ( 4 %) for model development and testing

metric training validation testing
lines 80079 26693 26693
words 1683612 556813 572087

Data cleaning

We are cleaning the training text by 1) converting to lowercase, 2) removing numbers, 3) removing special symbols, 4) removing punctuation, 5) removing extra white spaces.

We are not removing stop words and “bad” words assuming that they are statistically not important.

Cleaned corpus training text file: 80079 lines, 381 MB

Ngram creation and analysis

Based on cleaned corpus file we are creating 1-, 2-, 3-, 4-grams and analyzing them to make decision on prediction model.

metric Ngram1 Ngram2 Ngram3 Ngram4
total terms 64421 626333 1194536 1367659
frequency >1 31659 138548 106725 36567
size MB 5 51 104 129

Ngram adjustment

The size of Ngrams (250MG+) is too large for our application. How to optimize the Ngrams?

We have 64000 Unigrams. This is the reasonable set in terms of language coverage as the average human vocabulary is in the range of 20.000-30.000 words.

Words are distributed very unequally in terms of use frequency. 50% of word counts relate to top 250 words only. 90% of word counts relate to top 8000 words. 95% of word counts relate to top 20000 words.

One of the optimization approaches is to introduce the dictionary and to limit all Ngrams to words in the dictionary only. I have done this exercise for 30.000 word dictionary. This approach is reducing the size of 2-, 3-, 4-grams by dozens of percents that is not enough.

Another approach is to to exclude 2-, 3-, 4-grams with count equal to 1 as they are rare. This is reducing the Ngram volume to 20MB+ that looks reasonable.

It has been decided to use the latter approach.

Adjusted Ngrams

metric Ngram1 Ngram2 Ngram3 Ngram4
terms 64421 138548 106725 36567
size MB 5 11 9 3