2023-08-26

Introduction

Text-Ease is an app which suggests 3 words that can follow the text entered by the user.

The model behind the app was built on the data from 3 internet sources - twitter, blogs, and news.

The model uses likelihood of the next word and maximizes it by patterns learned from the above data set and displays the top 3 words which can follow the last 3 words from the text entered by the user.

Salient Features of the App

  1. The app provides suggestions and allows the user to click any suggestion they want and update their input text, OR Gives them the option to keep typing and the model will keep suggesting words
  2. The app also takes into consideration sentence end tokens like .,?, or ! and changes the suggested words accordingly
  3. The model behind the app makes sure 3 predictions are always provided even if the entered word does not occur in the training set.

Model Summary

  1. The data was cleaned and then divided into sentences.
  2. A random sample of 1 million sentences was used as the training data.
  3. These sentences were divided into 4 word tokens(quadgrams) and were One-Hot Encoded(each unique word was given a unique number) to conserve memory usage
  4. Techniques to deal with unknown data
    • Unknown Tokens introduced - Any words occurring just once in the training set were replaced with <unk> placeholder to represent unseen words
    • Modified Kneser-Ney Smoothing - The counts and probability were adjusted according to the modified Kneser-Ney smoothing recommended by Chen & Goodman, 1995
  5. Get top 3 choices for each of the following combinations
    • 3 word tokens
    • 2 word tokens (in case 3 word tokens do not match with the input text)
    • 1 word tokens with 1 skipped word context (in case 2 word tokens do not match)
    • 1 word tokens with 2 skipped word context (in case no matches found above)
    • 1 word tokens with no context (in case no matches found above)
  6. Different tables were saved to be used by the app to predict the next word, thus ensuring minimal calculations done on the app server

Model Metrics

On the test data set (2500 sentences)
Final Model Training Sample Size (sentences) Vocabulary Size Loss (per n-gram) Perplexity (per n-gram) Accuracy (%)
Modified Kneser-Ney backoff model 1 million sentences 103089 7.193934 146.4164 29.36

Glimpse of the app

How to use the app?

  1. Type in the black box
  2. Three words will be suggested in the navy blue box above the black box
  3. You can click on any of the suggested words and the black box will be updated accordingly OR
  4. You can keep typing and the suggestions will change accordingly.

Link to the app - Text-Ease

Link to the github repository with all the code - Text-Ease Github

License of the app and code - MIT License